baibaichen opened a new issue, #11494:
URL: https://github.com/apache/incubator-gluten/issues/11494

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   **Expected behavior**: When using `ALTER TABLE ADD COLUMN` with a DEFAULT 
value, existing rows should return the default value for the newly added column 
when queried (via EXISTS_DEFAULT lazy backfill mechanism).
   
   **Actual behavior**: Existing rows return `NULL` for the newly added column 
instead of the configured default value.
   
   This affects the EXISTS_DEFAULT lazy backfill mechanism introduced in Spark 
3.4.0 (SPARK-38334). While newly inserted data (after ALTER) correctly uses 
default values, old data (inserted before ALTER) returns NULL.
   
   ### Reproduction Steps
   
   ```scala
   // Test case from GlutenInsertSuite
   testGluten("ALTER ADD COLUMN DEFAULT - old INT data backfill") {
     withTable("t") {
       sql("create table t(a int) using parquet")
       
       // Insert OLD data BEFORE ALTER
       sql("insert into t values(1)")
       sql("insert into t values(2)")
       
       // ALTER TABLE ADD COLUMN with DEFAULT
       sql("alter table t add column (b int default 99)")
       
       // Query old data
       spark.table("t").orderBy("a").show()
     }
   }
   ```
   
   **Expected Result**:
   ```
   +---+---+
   |  a|  b|
   +---+---+
   |  1| 99|
   |  2| 99|
   +---+---+
   ```
   
   **Actual Result**:
   ```
   +---+----+
   |  a|   b|
   +---+----+
   |  1|null|
   |  2|null|
   +---+----+
   ```
   
   ### Test Failure Output
   
   ```
   == Results ==
   !== Correct Answer - 2 ==   == Spark Answer - 2 ==
   !struct<>                   struct<a:int,b:int>
   ![1,99]                     [1,null]
   ![2,99]                     [2,null]
   ```
   
   All 5 test cases in GlutenInsertSuite fail with the same pattern:
   1. `ALTER ADD COLUMN DEFAULT - old INT data backfill` - returns NULL instead 
of 99
   2. `ALTER ADD COLUMN DEFAULT - old STRING data backfill` - returns NULL 
instead of 'unknown'
   3. `ALTER ADD COLUMN DEFAULT - old BOOLEAN data backfill` - returns NULL 
instead of true
   4. `ALTER ADD COLUMN DEFAULT - old DATE data backfill` - returns NULL 
instead of date '2023-01-01'
   5. `ALTER ADD COLUMN DEFAULT - multiple types old data backfill` - returns 
NULL for all default columns
   
   ### Gluten version
   
   main branch
   
   ### Spark version
   
   Spark-3.4.x, Spark-3.5.x, Spark-4.0.x, Spark-4.1.x
   
   ### System information
   
   - Backend: Velox
   - File format: Parquet
   - Test suite: 
`gluten-ut/spark41/src/test/scala/org/apache/spark/sql/sources/GlutenInsertSuite.scala`
   
   ### Root Cause Analysis
   
   Spark's DEFAULT column implementation uses two metadata keys:
   
   1. **CURRENT_DEFAULT**: Original SQL expression (used for 
INSERT/UPDATE/MERGE)
   2. **EXISTS_DEFAULT**: Constant-folded result (used for backfilling missing 
columns in existing data)
   
   When reading old Parquet files that don't have the new column, data sources 
should call `getExistenceDefaultValues()` to fill missing columns with default 
values. This appears to not be implemented in Velox backend.
   
   ### Related Spark Issues
   
   - SPARK-38334: Implement support for DEFAULT values for columns in tables 
(Spark 3.4.0)
   - SPARK-38811: Support for default column values in ALTER TABLE ADD COLUMN
   - SPARK-39265: File source v2: support default values for Parquet
   
   ### Additional Context
   
   The newly added tests in GlutenInsertSuite (lines 601-721) specifically test 
the EXISTS_DEFAULT backfill scenario that Spark's own test suite doesn't cover. 
These tests expose this Velox limitation.
   
   Note: NEW data inserted AFTER the ALTER statement correctly uses default 
values. Only OLD data (inserted before ALTER) is affected.
   
   ---
   
   *This issue was generated with AI assistance (Claude Code).*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to