Xiao Li created SPARK-16689:
-------------------------------

             Summary: FileSourceStrategy: Pruning Partition Columns When No 
Partition Column Exist in Project
                 Key: SPARK-16689
                 URL: https://issues.apache.org/jira/browse/SPARK-16689
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Xiao Li


For partitioned file sources, the current implementation always scans all the 
partition columns. However, this is not necessary when the projected column 
list does not include any partition column. In addition, we also can avoid the 
unnecessary Project.

Below is an example,
{noformat}
spark
  .range(N)
  .selectExpr("id AS value1", "id AS value2", "id AS p1", "id AS p2", "id AS 
p3")
  .toDF("value", "value2", "p1", "p2", "p3").write.format("json")
  .partitionBy("p1", "p2", "p3").save(tempDir)

spark.read.format("json").load(tempDir).selectExpr("value")
{noformat}

Before the PR changes, the physical plan is like:
{noformat}
== Physical Plan ==
*Project [value#37L]
+- *Scan json [value#37L,p1#39,p2#40,p3#41] Format: JSON, InputPaths: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-f7a4294a-2e1b-4f44-9ebb-1a5eb...,
 PushedFilters: [], ReadSchema: struct<value:bigint>
{noformat}

After the PR changes, the physical plan becomes:
{noformat}
== Physical Plan ==
*Scan json [value#147L] Format: JSON, InputPaths: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a5bcb14a-46c2-4c20-8f34-9662b...,
 PushedFilters: [], ReadSchema: struct<value:bigint>
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to