baibaichen opened a new issue, #11494:
URL: https://github.com/apache/incubator-gluten/issues/11494
### Backend
VL (Velox)
### Bug description
**Expected behavior**: When using `ALTER TABLE ADD COLUMN` with a DEFAULT
value, existing rows should return the default value for the newly added column
when queried (via EXISTS_DEFAULT lazy backfill mechanism).
**Actual behavior**: Existing rows return `NULL` for the newly added column
instead of the configured default value.
This affects the EXISTS_DEFAULT lazy backfill mechanism introduced in Spark
3.4.0 (SPARK-38334). While newly inserted data (after ALTER) correctly uses
default values, old data (inserted before ALTER) returns NULL.
### Reproduction Steps
```scala
// Test case from GlutenInsertSuite
testGluten("ALTER ADD COLUMN DEFAULT - old INT data backfill") {
withTable("t") {
sql("create table t(a int) using parquet")
// Insert OLD data BEFORE ALTER
sql("insert into t values(1)")
sql("insert into t values(2)")
// ALTER TABLE ADD COLUMN with DEFAULT
sql("alter table t add column (b int default 99)")
// Query old data
spark.table("t").orderBy("a").show()
}
}
```
**Expected Result**:
```
+---+---+
| a| b|
+---+---+
| 1| 99|
| 2| 99|
+---+---+
```
**Actual Result**:
```
+---+----+
| a| b|
+---+----+
| 1|null|
| 2|null|
+---+----+
```
### Test Failure Output
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
!struct<> struct<a:int,b:int>
![1,99] [1,null]
![2,99] [2,null]
```
All 5 test cases in GlutenInsertSuite fail with the same pattern:
1. `ALTER ADD COLUMN DEFAULT - old INT data backfill` - returns NULL instead
of 99
2. `ALTER ADD COLUMN DEFAULT - old STRING data backfill` - returns NULL
instead of 'unknown'
3. `ALTER ADD COLUMN DEFAULT - old BOOLEAN data backfill` - returns NULL
instead of true
4. `ALTER ADD COLUMN DEFAULT - old DATE data backfill` - returns NULL
instead of date '2023-01-01'
5. `ALTER ADD COLUMN DEFAULT - multiple types old data backfill` - returns
NULL for all default columns
### Gluten version
main branch
### Spark version
Spark-3.4.x, Spark-3.5.x, Spark-4.0.x, Spark-4.1.x
### System information
- Backend: Velox
- File format: Parquet
- Test suite:
`gluten-ut/spark41/src/test/scala/org/apache/spark/sql/sources/GlutenInsertSuite.scala`
### Root Cause Analysis
Spark's DEFAULT column implementation uses two metadata keys:
1. **CURRENT_DEFAULT**: Original SQL expression (used for
INSERT/UPDATE/MERGE)
2. **EXISTS_DEFAULT**: Constant-folded result (used for backfilling missing
columns in existing data)
When reading old Parquet files that don't have the new column, data sources
should call `getExistenceDefaultValues()` to fill missing columns with default
values. This appears to not be implemented in Velox backend.
### Related Spark Issues
- SPARK-38334: Implement support for DEFAULT values for columns in tables
(Spark 3.4.0)
- SPARK-38811: Support for default column values in ALTER TABLE ADD COLUMN
- SPARK-39265: File source v2: support default values for Parquet
### Additional Context
The newly added tests in GlutenInsertSuite (lines 601-721) specifically test
the EXISTS_DEFAULT backfill scenario that Spark's own test suite doesn't cover.
These tests expose this Velox limitation.
Note: NEW data inserted AFTER the ALTER statement correctly uses default
values. Only OLD data (inserted before ALTER) is affected.
---
*This issue was generated with AI assistance (Claude Code).*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]