[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068447#comment-15068447
 ] 

Yan commented on SPARK-12449:
-----------------------------

A few thoughts on the capabilities of this "CatalystSource Interface":

1) provide data source partition info given a filtering predicate. 
Holistic/Partitioned Execution could also be (partially) controlled by this 
output. It will make the partition pruning pluggable;
2) have an interface to transform a portion of a physical plan into a "pushed 
down" plan, plus a "left-over" plan for execution inside Spark. Spark Planning 
may need to curve up the portion from the original plan so that the portion 
only contains the same data source and leave the execution on data from across 
different data sources in Spark. This will leave the decision of what portion 
of plan can be pushed down in the hands of the data sources. In particular, 
pushdown of either a whole SQL or just map-side executions could be supported.
3) On carving up the portion of the plan, the Spark Planning can start from the 
SCAN and move "downstream", and may (optionally?) want to stop on the branch at 
any intermediate DF that are to be cached or persistent so as to honor the 
Spark's execution at no extra cost.  

> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
>                 Key: SPARK-12449
>                 URL: https://issues.apache.org/jira/browse/SPARK-12449
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Stephan Kessler
>         Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to