[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Kessler updated SPARK-12449:
------------------------------------
    Description: 
With the help of the DataSource API we can pull data from external sources for 
processing. Implementing interfaces such as {{PrunedFilteredScan}} allows to 
push down filters and projects pruning unnecessary fields and rows directly in 
the data source.

However, data sources such as SQL Engines are capable of doing even more 
preprocessing, e.g., evaluating aggregates. This is beneficial because it would 
reduce the amount of data transferred from the source to Spark. The existing 
interfaces do not allow such kind of processing in the source.

We would propose to add a new interface {{CatalystSource}} that allows to defer 
the processing of arbitrary logical plans to the data source. We have already 
shown the details at the Spark Summit 2015 Europe 
[https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]

I will add a design document explaining details. 

  was:
With the help of the DataSource API we can pull data from external sources for 
processing. Implementing interfaces such as {{PrunedFilteredScan}} allows to 
push down filters and projects pruning unnecessary fields and rows directly in 
the data source.

However, data sources such as SQL Engines are capable of doing even more 
preprocessing, e.g., evaluating aggregates. This is beneficial because it would 
reduce the amount of data transferred from the source to Spark. The existing 
interfaces do not allow such kind of processing in the source.

We would propose to add a new interface ({{CatalystSource}} that allows to 
defer the processing of arbitrary logical plans to the data source. We have 
already shown the details at the Spark Summit 2015 Europe 
[https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]

I will add a design document explaining details. 


> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
>                 Key: SPARK-12449
>                 URL: https://issues.apache.org/jira/browse/SPARK-12449
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Stephan Kessler
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to