Re: Using Calcite at LinkedIn

Walaa Eldin Moustafa Sat, 12 Dec 2020 09:53:28 -0800

Hi JiaTao,

That sounds interesting. A few questions:

1- When you go from RelNode to "Spark DataFrame plan", did you mean you go
to:
  * Spark SQL
  * Spark DataFrame Scala code
  * In-memory Spark Catalyst Plan
  * Human readable string representation of Spark plan (e.g., similar to
DataFrame.explain)
  * Some serialization of the in-memory Spark plan (similar to human
readable, but more ser/de friendly without necessarily being human
readable)?

2- In your Presto and Spark conversions mentioned below, you stated you
start from a RelNode. Could you clarify where the RelNode is originally
coming from? What is the use case in both?

3- That would be awesome if you could contribute to coral-spark-plan [1].
Currently its objective is to convert human readable Spark plan (output of
DataFrame.explain) to RelNode. Right now it can do basic conversions (see
test cases [2]).This module can help with:
** Analyzing Spark jobs (we have used it figure out which Spark jobs in our
history server push down complex predicates down, as complex predicates are
not supported on DataSource V2 [3])
** Converting arbitrary Spark logic to other platforms (e.g., Spark
Catalyst plan to Presto), since even Scala code ends up being represented
in the plan string in a structured way.
 ** Converting Spark scala code back to SQL

[1] https://github.com/linkedin/coral/tree/master/coral-spark-plan
[2]
https://github.com/linkedin/coral/blob/master/coral-spark-plan/src/test/java/com/linkedin/coral/sparkplan/SparkPlanToIRRelConverterTest.java
[3]
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkStrategy-DataSourceStrategy.html

Thanks,
Walaa.

On Sat, Dec 12, 2020 at 5:24 AM JiaTao Tao <[email protected]> wrote:

> Hi Walaa
> Very happy to see this, our team basically do the same thing, a unified SQL
> layer:
> 1. Spark: RelNode -> Spark DataFrame plan
> 2. Presto: RelNode -> In string SQL
> 3. Clickhouse: RelNode -> Serialized RelNode
> 4. Flink -> TBD(with datastream API or table API)
>
>
> I do point 1 both in my previous company and current company, maybe I can
> participate in this part:  analyze and translate Spark Catalyst plans.
>
>
> Regards!
>
> Aron Tao
>
>
> Walaa Eldin Moustafa <[email protected]> 于2020年12月12日周六 上午5:34写道：
>
> > Hi Calcite community,
> >
> > I wanted to share a recently published LinkedIn's blog series article [1]
> > on how Calcite helps us build a smarter data lake using Coral [2]. Hope
> you
> > find it interesting. Also, if you want to discuss with our team and the
> > data lake + Calcite community, please feel free to join our Coral Slack
> > workspace [3].
> >
> > [1] https://engineering.linkedin.com/blog/2020/coral
> > [2] https://github.com/linkedin/coral
> > [3]
> >
> >
> https://join.slack.com/t/coral-sql/shared_invite/zt-j9jw5idg-mkt3fjA~wgoUEMXXZqMr6g
> >
> > Thanks,
> > Walaa.
> >
>

Re: Using Calcite at LinkedIn

Reply via email to