Jordan, super interesting work you've shared. It would be very cool to get this incorporated back into Spark mainline. That would continue to broaden Calcite's reach :)
On Fri, Jan 20, 2017 at 1:36 PM, [email protected] < [email protected]> wrote: > So, AFAIK the Spark adapter that's inside Calcite is in an unusable state > right now. It's still using Spark 1.x and last time I tried it I couldn't > get it to run. It probably needs to either be removed or completely > rewritten. But I can certainly offer some guidance on working with Spark > and Calcite. > > As we were discussing on the other thread, I've been doing research on > optimizing Spark queries with Calcite at my company. It may or may not be > open sourced some time in the near future, I don't know yet. > > So, there are really a couple ways to go about optimizing Spark queries > using Calcite. The first option is the approach the current code in Calcite > takes: use Calcite on RDDs. The code that you see in Calcite seems likely > to have been developed prior to Spark SQL existing or at least as an > alternative to Spark SQL. It allows you to run Calcite SQL queries on Spark > by converting optimized Calcite plans into Spark RDD operations, using RDD > methods for relational expressions and Calcite's Enumerables for row > expressions. > > Alternatively, what we wanted to do when we started our project was > integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets > are widely used APIs, and we wanted to see if we could apply Calcite's > significantly better optimization techniques to Spark's plans without > breaking the API. So, that's the second way to go about it. What we did is > essentially implemented a custom Optimizer (a Spark interface) that > converted from Spark logical plans to Calcite logical plans, used Calcite > to optimize the plan, and then converted from Calcite back to Spark. > Essentially, this is a complete replacement of the optimization phase of > Catalyst (Spark's optimizer). But converting from Spark plans to Calcite > plans and back is admittedly a major challenge that has taken months to > perfect for more complex expressions like aggregations/grouping sets. > > So, the two options are really: replace Spark SQL with Calcite, or > integrate Calcite into Spark SQL. The former is a fairly straightforward > use case for Calcite. The latter requires a deep understanding of both > Calcite's and Spark's relational algebra and writing algorithms to convert > between the two. But I can say that it has been very successful. We've been > able to improve Spark's performance quite significantly on all different > types of data - including flat files - and have seen 1-2 orders of > magnitude improvements in Spark's performance against databases like > Postgres, Redshift, Mongo, etc in TPC-DS benchmarks. > > > On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini < > [email protected]> wrote: > > > > Hello, > > I'm trying to understand how to use the spark adapter. > > > > Does anyone have any example? > > > > Thanks in advance > > > > Riccardo Tommasini > > Master Degree Computer Science > > PhD Student at Politecnico di Milano (Italy) > > streamreasoning.org<http://streamreasoning.org/> > > > > Submitted from an iPhone, I apologise for typos. >
