(spark) branch branch-3.5 updated: [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy

yao Thu, 25 Jul 2024 01:52:38 -0700

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new c4ef321d5599 [SPARK-48308][CORE][3.5] Unify getting data schema 
without partition columns in FileSourceStrategy
c4ef321d5599 is described below

commit c4ef321d5599349cd6a2a6d69f7cd532887d7bb6
Author: Johan Lasperas <[email protected]>
AuthorDate: Thu Jul 25 16:52:16 2024 +0800

    [SPARK-48308][CORE][3.5] Unify getting data schema without partition 
columns in FileSourceStrategy
    
    ### What changes were proposed in this pull request?
    
    (Cherry-pick of 57948c865e064469a75c92f8b58c632b9b40fdd3 to branch-3.5)
    
    Compute the schema of the data without partition columns only once in 
FileSourceStrategy.
    
    ### Why are the changes needed?
    
    In FileSourceStrategy, the schema of the data excluding partition columns 
is computed 2 times in a slightly different way, using an AttributeSet 
(`partitionSet`) and using the attributes directly (`partitionColumns`) These 
don't have the exact same semantics, AttributeSet will only use expression ids 
for comparison while comparing with the actual attributes will use the name, 
type, nullability and metadata. We want to use the former here.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Authored-by: Johan Lasperas <johan.lasperasdatabricks.com>
    
    Closes #47483 from vkorukanti/partitionCols.
    
    Authored-by: Johan Lasperas <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
---
 .../apache/spark/sql/execution/datasources/FileSourceStrategy.scala    | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
index e4bf24ad88d1..9fe42c6bcf2b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
@@ -210,9 +210,8 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
       val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq 
++ projects
       val requiredAttributes = AttributeSet(requiredExpressions)
 
-      val readDataColumns = dataColumns
+      val readDataColumns = dataColumnsWithoutPartitionCols
         .filter(requiredAttributes.contains)
-        .filterNot(partitionColumns.contains)
 
       // Metadata attributes are part of a column of type struct up to this 
point. Here we extract
       // this column from the schema and specify a matcher for that.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-3.5 updated: [SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy

Reply via email to