[
https://issues.apache.org/jira/browse/SPARK-16689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-16689.
-------------------------------
Resolution: Won't Fix
> FileSourceStrategy: Pruning Partition Columns When No Partition Column Exist
> in Project
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-16689
> URL: https://issues.apache.org/jira/browse/SPARK-16689
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Xiao Li
>
> For partitioned file sources, the current implementation always scans all the
> partition columns. However, this is not necessary when the projected column
> list does not include any partition column. In addition, we also can avoid
> the unnecessary Project.
> Below is an example,
> {noformat}
> spark
> .range(N)
> .selectExpr("id AS value1", "id AS value2", "id AS p1", "id AS p2", "id AS
> p3")
> .toDF("value", "value2", "p1", "p2", "p3").write.format("json")
> .partitionBy("p1", "p2", "p3").save(tempDir)
> spark.read.format("json").load(tempDir).selectExpr("value")
> {noformat}
> Before the PR changes, the physical plan is like:
> {noformat}
> == Physical Plan ==
> *Project [value#37L]
> +- *Scan json [value#37L,p1#39,p2#40,p3#41] Format: JSON, InputPaths:
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-f7a4294a-2e1b-4f44-9ebb-1a5eb...,
> PushedFilters: [], ReadSchema: struct<value:bigint>
> {noformat}
> After the PR changes, the physical plan becomes:
> {noformat}
> == Physical Plan ==
> *Scan json [value#147L] Format: JSON, InputPaths:
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a5bcb14a-46c2-4c20-8f34-9662b...,
> PushedFilters: [], ReadSchema: struct<value:bigint>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]