Re: Calcite vs Catalyst

Jordan Halterman Wed, 31 May 2017 15:44:45 -0700

I left DataScience a few months ago to go lead distributed systems work at the 
Open Networking Laboratory. Originally, the week I left they said they intended 
to open source it, but that never happened. I no longer have control over the 
project, but I’m still in touch with my old partner in crime, Jason Slepicka. 
DataScience has been hesitant to open source it for financial reasons that I 
disagree with, but Jason recently mentioned they’re warming to the idea again. 
Hopefully that will happen soon. Sitting on it for so long isn’t helping, but 
it’s out of my hands for now.


Jordan
> On May 31, 2017, at 10:44 AM, Khai Tran <[email protected]> wrote:
> 
> Hi Jordan,
> Just want to check if you guys have any plan to contribute back the work of 
> converting back and forth between Calcite and Spark/Catalyst plans?
> 
> Thanks,
> Khai
> 
> On Thu, Feb 16, 2017 at 3:42 PM, [email protected] 
> <mailto:[email protected]> <[email protected] 
> <mailto:[email protected]>> wrote:
> Calcite differs from Catalyst in many ways. First of all, Catalyst is 
> essentially a heuristic optimizer, while Calcite optimizers often combine 
> heuristics and cost-based optimization. Catalyst pushes down predicates and 
> projections to most data sources, while Calcite can often push down full 
> queries. It's certainly also capable of pushing down filters for struct 
> fields. Some of these types of features like SPARK-19609 may have to be 
> implemented as custom rules. But we've successfully replaced Spark's Catalyst 
> optimizer with Calcite and have recorded up to two orders of magnitude 
> improvements in performance running TPC-DS queries against many databases.
> 
> Whether there's value in using Calcite in Spark depends on your use case. 
> Drill and other systems are certainly sufficient to take better advantage of 
> the features of underlying databases. It's not easy to build the conversions 
> between Catalyst plans and Calcite plans - it took us months - but doing so 
> allowed us to continue using Spark's popular programmatic APIs while 
> significantly improving its performance when querying relational databases, 
> Mongo, etc.
> 
> > On Feb 16, 2017, at 3:28 PM, Nick Dimiduk <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > Heya,
> >
> > I've been using Spark recently and have stumbled across a couple surprising
> > bugs/feature gaps. It got me curious about how Calcite would handle the
> > same scenarios. Basically, I'm wondering if Calcite would handle these
> > scenarios directly or if it would defer to the underlying runtime. I.E.,
> > would I be better off for this task with Calcite via Hive or Drill vs.
> > Catalyst via Spark.
> >
> > Here are the tickets for reference.
> >
> > SPARK-19615 Provide Dataset union convenience for divergent schema
> > SPARK-19609 Broadcast joins should pushdown join constraints as Filter to
> > the larger relation
> > SPARK-19638 Filter pushdown not working for struct fields
> >
> > Thanks in advance!
> > Nick
>

Re: Calcite vs Catalyst

Reply via email to