[spark] branch branch-2.4 updated: [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal fields with a larger precision

gurwls223 Tue, 02 Feb 2021 16:28:47 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 5f4e9ea  [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can 
read decimal fields with a larger precision
5f4e9ea is described below

commit 5f4e9ea7a1a70b7ba3c5ff1a4977f019ab43a3a1
Author: Wenchen Fan <[email protected]>
AuthorDate: Wed Feb 3 09:26:36 2021 +0900

    [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal 
fields with a larger precision
    
    ### What changes were proposed in this pull request?
    
    This is a followup of https://github.com/apache/spark/pull/31357
    
    #31357 added a very strong restriction to the vectorized parquet reader, 
that the spark data type must exactly match the physical parquet type, when 
reading decimal fields. This restriction is actually not necessary, as we can 
safely read parquet decimals with a larger precision. This PR releases this 
restriction a little bit.
    
    ### Why are the changes needed?
    
    To not fail queries unnecessarily.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, now users can read parquet decimals with mismatched `DecimalType` as 
long as the scale is the same and precision is larger.
    
    ### How was this patch tested?
    
    updated test.
    
    Closes #31443 from cloud-fan/improve.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    (cherry picked from commit 00120ea53748d84976e549969f43cf2a50778c1c)
    Signed-off-by: HyukjinKwon <[email protected]>
---
 .../sql/execution/datasources/parquet/VectorizedColumnReader.java | 4 +++-
 sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala  | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
index 4739089..ed8755c 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
@@ -106,7 +106,9 @@ public class VectorizedColumnReader {
   private boolean isDecimalTypeMatched(DataType dt) {
     DecimalType d = (DecimalType) dt;
     DecimalMetadata dm = descriptor.getPrimitiveType().getDecimalMetadata();
-    return dm != null && dm.getPrecision() == d.precision() && dm.getScale() 
== d.scale();
+    // It's OK if the required decimal precision is larger than or equal to 
the physical decimal
+    // precision in the Parquet metadata, as long as the decimal scale is the 
same.
+    return dm != null && dm.getPrecision() <= d.precision() && dm.getScale() 
== d.scale();
   }
 
   private boolean canReadAsIntDecimal(DataType dt) {
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index a2efed6..f262eab 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -3152,6 +3152,14 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
       val df = sql("SELECT 1.0 a, CAST(1.23 AS DECIMAL(17, 2)) b, CAST(1.23 AS 
DECIMAL(36, 2)) c")
       df.write.parquet(path.toString)
 
+      Seq(true, false).foreach { vectorizedReader =>
+        withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorizedReader.toString) {
+          // We can read the decimal parquet field with a larger precision, if 
scale is the same.
+          val schema = "a DECIMAL(9, 1), b DECIMAL(18, 2), c DECIMAL(38, 2)"
+          checkAnswer(readParquet(schema, path), df)
+        }
+      }
+
       withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
         val schema1 = "a DECIMAL(3, 2), b DECIMAL(18, 3), c DECIMAL(37, 3)"
         checkAnswer(readParquet(schema1, path), df)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-2.4 updated: [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal fields with a larger precision

Reply via email to