Re: Spark and Calcite

Jacques Nadeau Fri, 20 Jan 2017 16:25:13 -0800

Jordan, super interesting work you've shared. It would be very cool to get
this incorporated back into Spark mainline. That would continue to broaden
Calcite's reach :)


On Fri, Jan 20, 2017 at 1:36 PM, [email protected] <
[email protected]> wrote:

> So, AFAIK the Spark adapter that's inside Calcite is in an unusable state
> right now. It's still using Spark 1.x and last time I tried it I couldn't
> get it to run. It probably needs to either be removed or completely
> rewritten. But I can certainly offer some guidance on working with Spark
> and Calcite.
>
> As we were discussing on the other thread, I've been doing research on
> optimizing Spark queries with Calcite at my company. It may or may not be
> open sourced some time in the near future, I don't know yet.
>
> So, there are really a couple ways to go about optimizing Spark queries
> using Calcite. The first option is the approach the current code in Calcite
> takes: use Calcite on RDDs. The code that you see in Calcite seems likely
> to have been developed prior to Spark SQL existing or at least as an
> alternative to Spark SQL. It allows you to run Calcite SQL queries on Spark
> by converting optimized Calcite plans into Spark RDD operations, using RDD
> methods for relational expressions and Calcite's Enumerables for row
> expressions.
>
> Alternatively, what we wanted to do when we started our project was
> integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets
> are widely used APIs, and we wanted to see if we could apply Calcite's
> significantly better optimization techniques to Spark's plans without
> breaking the API. So, that's the second way to go about it. What we did is
> essentially implemented a custom Optimizer (a Spark interface) that
> converted from Spark logical plans to Calcite logical plans, used Calcite
> to optimize the plan, and then converted from Calcite back to Spark.
> Essentially, this is a complete replacement of the optimization phase of
> Catalyst (Spark's optimizer). But converting from Spark plans to Calcite
> plans and back is admittedly a major challenge that has taken months to
> perfect for more complex expressions like aggregations/grouping sets.
>
> So, the two options are really: replace Spark SQL with Calcite, or
> integrate Calcite into Spark SQL. The former is a fairly straightforward
> use case for Calcite. The latter requires a deep understanding of both
> Calcite's and Spark's relational algebra and writing algorithms to convert
> between the two. But I can say that it has been very successful. We've been
> able to improve Spark's performance quite significantly on all different
> types of data - including flat files - and have seen 1-2 orders of
> magnitude improvements in Spark's performance against databases like
> Postgres, Redshift, Mongo, etc in TPC-DS benchmarks.
>
> > On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini <
> [email protected]> wrote:
> >
> > Hello,
> > I'm trying to understand how to use the spark adapter.
> >
> > Does anyone have any example?
> >
> > Thanks in advance
> >
> > Riccardo Tommasini
> > Master Degree Computer Science
> > PhD Student at Politecnico di Milano (Italy)
> > streamreasoning.org<http://streamreasoning.org/>
> >
> > Submitted from an iPhone, I apologise for typos.
>

Re: Spark and Calcite

Reply via email to