vivek-balakrishnan-rovio commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-831752144


   Hello,
   I am a Data Engineer working for [Rovio](https://www.rovio.com/).
   We have been using Druid for close to 1.5 years to power our analytics 
dashboards.
   We had a similar requirement to export our data from Hive tables to Druid.
   
   Initially, we had patched the DruidStorageHandler of Hive to work with EMR 
and S3.
   Later, we decided to write a [Apache Spark write only 
datasource](https://github.com/rovio/rovio-ingest) inspired by Hive's 
DruidStorageHandler. At that point we were not aware of any similar effort 
going on.
   We notice that this PR has a richer feature set than our library and also 
supports both read & write even for complex metrics (Sketches & HLL). Our 
library has been mostly driven by our internal needs only supports writing and 
has supports only for basic metrics aggregation at the moment.
   
   
   Regarding writing, 
   We took a similar approach for partitioning the dataset before writing as 
supported in this PR.
   We also wrote a [Scala 
extension](https://github.com/rovio/rovio-ingest/blob/main/src/main/scala/com/rovio/ingest/extensions/DruidDatasetExtensions.scala)
 and a [python 
wrapper](https://github.com/rovio/rovio-ingest/blob/main/python/rovio_ingest/extensions/dataframe_extension.py)
 to abstract the partitioning logic.
   Unlike this PR, our library expects the data to be partitioned by __time 
column and does not do any segment rationalizing in the commit phase.
   
   If you would like to try our library, the modules are available in maven 
central and PyPI (as documented in README).
   
   We are excited that there is an **official** library for reading and writing 
Druid segments with Spark dataframes.
   
   Regarding the feature gap between your extension and rovio-ingest, how do 
you envision the future? Would you be open to for example adding a pyspark 
wrapper?
   
   By the way, rovio-ingest was already added to the list of Community and 
Third Party Software of druid website:
   https://github.com/apache/druid-website-src/pull/231
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to