[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080858#comment-15080858
 ] 

Stephan Kessler commented on SPARK-12449:
-----------------------------------------

Thank you for your thoughts - unfortunately i did not get all of them, could 
you help me out?

1) The partitioning in the case we are currently handling comes from the 
datasource itself. Rule-of-a-thumb: If multiple instances of the data source 
exists, it is "partitioned". We do not make any assumption about data 
distributions. Obviously this might be beneficial for further operations (e.g., 
joins) if we do so, but currently we don't. The assumption behind it is that 
data source and Spark execution are loosly coupled, and Spark does not control 
loading or distribution of data.

2) This exactly is done with the physical planner strategy. It kind of "probes" 
if a datasource is capable of processing a fraction of the logical plan, if 
not, it does not return any physical operation (i.e. 
{{org.apache.spark.sql.execution}}) other planning strategies come first and 
might result in fractions of the plan, that are "planned later". On those left 
out parts, the data source strategy is called again.

3) If i get it right, this is the result of 2)? i guess i should add a more 
complex example to the design document, to make things more clear at that stage.

> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
>                 Key: SPARK-12449
>                 URL: https://issues.apache.org/jira/browse/SPARK-12449
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Stephan Kessler
>         Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to