[
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088334#comment-15088334
]
Yan commented on SPARK-12449:
-----------------------------
Stephan, thanks for your explanations and questions. My answers are as follows:
1) This is actually one point of having a "physical plan pruning interface" as
part of the DataSource interface. From just a logical plan, it'd be probably
hard to take advantage of data distribution info that Spark SQL is actually
capable of. Another advantage of a pluggable physical plan pruner is the
flexibility of making use of datasources' various capabilities, including
partial aggregation, some types of predicate/expression evaluations, ..., etc.
We feel the pain of the lack of such a "physical plan pruner" in developing the
Astro project
(http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase) which
forces us to use a separate SQLContext to incorporate many advanced planning
optimizations for HBase.
In fact, the current datasource API already supports predicate pruning in
*physical* plan, in a limited way though, in the method of "unhandledFilters".
2) I don't think current Spark SQL planning has the capability, and the "plan
later" is for a different purpose.
3) Right, this just a bit more details to 2). The idea is the same: physical
plan pruning.
The point seems to be: the question of logical plan pruning vs. physical plan
pruning is actually a question of what types of capabilities of a data source
are to be valued here, physical or logical. My take is physical, for Spark's
powerful capabilities in dataset/dataframe/SQL.... In fact, the
"isMultiplePartitionExecution" field of the proposed "CatalystSource"
interface, if true, signifies the willingness of leaving some *physical*
operations such as shuffling to the Spark engine. It might make more sense for
a SQL federation engine to do logical plan pruning. But Spark are much more
capable than a federated engine, I guess.
Admittedly, the stability and complexity of such an interface will be a big
issue as pointed out by Reynold. I'd just keep my eyes open on any
progresses/ideas/topics made in this field.
> Pushing down arbitrary logical plans to data sources
> ----------------------------------------------------
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows
> to push down filters and projects pruning unnecessary fields and rows
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more
> preprocessing, e.g., evaluating aggregates. This is beneficial because it
> would reduce the amount of data transferred from the source to Spark. The
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to
> defer the processing of arbitrary logical plans to the data source. We have
> already shown the details at the Spark Summit 2015 Europe
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]