JD,

What you want to do is very reasonable. Druid could benefit from a distributed 
execution engine that is able to do shuffles, parallel sorts, and parallel 
UDFs. Spark, Hive, Drill, Presto are examples of such engines.

Phoenix has a similar requirement, and “Drillix” (a combination of Phoenix with 
Drill) was proposed[1].

As has been discussed on this list previously, the Spark adapter is not 
production quality. Contributions welcome.

Julian

[1] http://phoenix.apache.org/presentations/Drillix.pdf 
<http://phoenix.apache.org/presentations/Drillix.pdf>

> On Jun 20, 2017, at 4:34 PM, JD Zheng <[email protected]> wrote:
> 
> Hi, Vladimir,
> 
> 
>> On Jun 20, 2017, at 2:29 PM, Vladimir Sitnikov <[email protected]> 
>> wrote:
>> 
>> JD>we still face the join problem though
>> 
>> Can you please clarify what is the typical dataset you are trying to join
>> (in the number of rows/bytes)?
> 
> Our dataset size varies, can go up to 100+GB. Since druid does not support 
> join, we definitely can not push down the join. If we do the join in calcite, 
> then the query is limited by the one host memory. That’s why I am thinking of 
> using SPARK engine to address this concern.
> 
>> Am I right you somehow struggle with "fetch everything from Druid and join
>> via Enumerable" and you are completely fine with "fetch everything from
>> Druid and join via Spark”?
> I am trying to see if we could do something similar as Hive does, directly 
> pull Druid segments and do the rest in spark.
> 
>> I'm not sure Spark itself would make things way faster.
>> 
> At least won’t have the one host memory bound issue.
> 
>> Could you share some queries along with dataset sizes and expected/actual
>> execution plans?
>> 
> 
>> Vladimir
> 

Reply via email to