vivek-balakrishnan-rovio commented on issue #9780: URL: https://github.com/apache/druid/issues/9780#issuecomment-831752144
Hello, I am a Data Engineer working for [Rovio](https://www.rovio.com/). We have been using Druid for close to 1.5 years to power our analytics dashboards. We had a similar requirement to export our data from Hive tables to Druid. Initially, we had patched the DruidStorageHandler of Hive to work with EMR and S3. Later, we decided to write a [Apache Spark write only datasource](https://github.com/rovio/rovio-ingest) inspired by Hive's DruidStorageHandler. At that point we were not aware of any similar effort going on. We notice that this PR has a richer feature set than our library and also supports both read & write even for complex metrics (Sketches & HLL). Our library has been mostly driven by our internal needs only supports writing and has supports only for basic metrics aggregation at the moment. Regarding writing, We took a similar approach for partitioning the dataset before writing as supported in this PR. We also wrote a [Scala extension](https://github.com/rovio/rovio-ingest/blob/main/src/main/scala/com/rovio/ingest/extensions/DruidDatasetExtensions.scala) and a [python wrapper](https://github.com/rovio/rovio-ingest/blob/main/python/rovio_ingest/extensions/dataframe_extension.py) to abstract the partitioning logic. Unlike this PR, our library expects the data to be partitioned by __time column and does not do any segment rationalizing in the commit phase. If you would like to try our library, the modules are available in maven central and PyPI (as documented in README). We are excited that there is an **official** library for reading and writing Druid segments with Spark dataframes. Regarding the feature gap between your extension and rovio-ingest, how do you envision the future? Would you be open to for example adding a pyspark wrapper? By the way, rovio-ingest was already added to the list of Community and Third Party Software of druid website: https://github.com/apache/druid-website-src/pull/231 Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
