JD, What you want to do is very reasonable. Druid could benefit from a distributed execution engine that is able to do shuffles, parallel sorts, and parallel UDFs. Spark, Hive, Drill, Presto are examples of such engines.
Phoenix has a similar requirement, and “Drillix” (a combination of Phoenix with Drill) was proposed[1]. As has been discussed on this list previously, the Spark adapter is not production quality. Contributions welcome. Julian [1] http://phoenix.apache.org/presentations/Drillix.pdf <http://phoenix.apache.org/presentations/Drillix.pdf> > On Jun 20, 2017, at 4:34 PM, JD Zheng <[email protected]> wrote: > > Hi, Vladimir, > > >> On Jun 20, 2017, at 2:29 PM, Vladimir Sitnikov <[email protected]> >> wrote: >> >> JD>we still face the join problem though >> >> Can you please clarify what is the typical dataset you are trying to join >> (in the number of rows/bytes)? > > Our dataset size varies, can go up to 100+GB. Since druid does not support > join, we definitely can not push down the join. If we do the join in calcite, > then the query is limited by the one host memory. That’s why I am thinking of > using SPARK engine to address this concern. > >> Am I right you somehow struggle with "fetch everything from Druid and join >> via Enumerable" and you are completely fine with "fetch everything from >> Druid and join via Spark”? > I am trying to see if we could do something similar as Hive does, directly > pull Druid segments and do the rest in spark. > >> I'm not sure Spark itself would make things way faster. >> > At least won’t have the one host memory bound issue. > >> Could you share some queries along with dataset sizes and expected/actual >> execution plans? >> > >> Vladimir >
