Hi, Vladimir, > On Jun 20, 2017, at 2:29 PM, Vladimir Sitnikov <[email protected]> > wrote: > > JD>we still face the join problem though > > Can you please clarify what is the typical dataset you are trying to join > (in the number of rows/bytes)?
Our dataset size varies, can go up to 100+GB. Since druid does not support join, we definitely can not push down the join. If we do the join in calcite, then the query is limited by the one host memory. That’s why I am thinking of using SPARK engine to address this concern. > Am I right you somehow struggle with "fetch everything from Druid and join > via Enumerable" and you are completely fine with "fetch everything from > Druid and join via Spark”? I am trying to see if we could do something similar as Hive does, directly pull Druid segments and do the rest in spark. > I'm not sure Spark itself would make things way faster. > At least won’t have the one host memory bound issue. > Could you share some queries along with dataset sizes and expected/actual > execution plans? > > Vladimir
