Xiao Li created SPARK-16689:
-------------------------------
Summary: FileSourceStrategy: Pruning Partition Columns When No
Partition Column Exist in Project
Key: SPARK-16689
URL: https://issues.apache.org/jira/browse/SPARK-16689
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li
For partitioned file sources, the current implementation always scans all the
partition columns. However, this is not necessary when the projected column
list does not include any partition column. In addition, we also can avoid the
unnecessary Project.
Below is an example,
{noformat}
spark
.range(N)
.selectExpr("id AS value1", "id AS value2", "id AS p1", "id AS p2", "id AS
p3")
.toDF("value", "value2", "p1", "p2", "p3").write.format("json")
.partitionBy("p1", "p2", "p3").save(tempDir)
spark.read.format("json").load(tempDir).selectExpr("value")
{noformat}
Before the PR changes, the physical plan is like:
{noformat}
== Physical Plan ==
*Project [value#37L]
+- *Scan json [value#37L,p1#39,p2#40,p3#41] Format: JSON, InputPaths:
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-f7a4294a-2e1b-4f44-9ebb-1a5eb...,
PushedFilters: [], ReadSchema: struct<value:bigint>
{noformat}
After the PR changes, the physical plan becomes:
{noformat}
== Physical Plan ==
*Scan json [value#147L] Format: JSON, InputPaths:
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a5bcb14a-46c2-4c20-8f34-9662b...,
PushedFilters: [], ReadSchema: struct<value:bigint>
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]