[GitHub] [spark] dtenedor opened a new pull request, #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

GitBox Fri, 12 Aug 2022 14:18:17 -0700


dtenedor opened a new pull request, #37501:
URL: https://github.com/apache/spark/pull/37501


   ### What changes were proposed in this pull request?
   
   Fix a bug in column DEFAULT support for non-vectorized Parquet scans, where 
inserting explicit NULL values to a column with a DEFAULT and then selecting 
the column back would sometimes erroneously return the default value. 
   
   To exercise the behavior:
   
   ```
   set spark.sql.parquet.enableVectorizedReader=false;
   create table t(a int) using parquet;
   insert into t values (42);
   alter table t add column b int default 42;
   insert into t values (43, null);
   select * from t;
   ```
   
   This should return two rows:
   
   `(42, 42) and (43, NULL)`
   
   But instead the scan missed the inserted NULL value, and returned the 
existence DEFAULT value of "42" instead:
   
   `(42, 42) and (43, 42)`.
   
   After this bug fix, Spark now returns the former correct result.
   
   ### Why are the changes needed?
   
   This fixes the correctness of SQL queries using Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The PR includes unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dtenedor opened a new pull request, #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Reply via email to