[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093533#comment-15093533
 ] 

Santiago M. Mola commented on SPARK-12449:
------------------------------------------

Implementing this interface or an equivalent one would help standarize a lot of 
advanced features that data sources have been doing for some time. And while 
doing so, it would prevent them from creating their own SQLContext variants or 
patching the running SQLContext at runtime (using extraStrategies).

Here's a list of data source that are currently this approach. It would also be 
good to take them into account for this JIRA. The proposed interface and 
strategy should probably support all of these use cases. Some of them also use 
their own catalog implementation, but that should be something for a separate 
JIRA.

*spark-sql-on-hbase*

Already mentioned by [~yzhou2001]. They are using HBaseContext with 
extraStrategies that inject HBaseStrategies doing aggregation push down:
https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/src/main/scala/org/apache/spark/sql/hbase/execution/HBaseStrategies.scala

*memsql-spark-connector*

They offer both their own SQLContext or inject their MemSQL-specific push down 
strategy on runtime.
They do match Catalyst's LogicalPlan in the same way we're proposing to push 
down filters, projects, aggregates, limits, sorts and joins:
https://github.com/memsql/memsql-spark-connector/blob/master/connectorLib/src/main/scala/com/memsql/spark/pushdown/MemSQLPushdownStrategy.scala

*spark-iqmulus*

Strategy injected to push down counts and some aggregates:

https://github.com/IGNF/spark-iqmulus/blob/master/src/main/scala/fr/ign/spark/iqmulus/ExtraStrategies.scala

*druid-olap*

They use SparkPlanner, Strategy and LogicalPlan APIs to do extensive push down. 
Their API usage could be limited to LogicalPlan only if this JIRA is 
implemented:

https://github.com/SparklineData/spark-druid-olap/blob/master/src/main/scala/org/apache/spark/sql/sources/druid/

*magellan* _(probably out of scope)_

Does its own BroadcastJoin. Although, it seems to me that this usage would be 
out of scope for us.

https://github.com/harsha2010/magellan/blob/master/src/main/scala/magellan/execution/MagellanStrategies.scala

> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
>                 Key: SPARK-12449
>                 URL: https://issues.apache.org/jira/browse/SPARK-12449
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Stephan Kessler
>         Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to