rahil-c commented on PR #17904:
URL: https://github.com/apache/hudi/pull/17904#issuecomment-3761550758

   Sharing some recent findings to give context to the current implementation, 
for maintaining separate classes 
   ```
   LanceBasicSchemaEvolution
   LanceFileFormatHelper 
   ```
   Instead of reusing `SparkBasicSchemaEvolution` and `ParquetFileFormatHelper`
   
   Currently for our parquet schema evolution support, we rely on spark's 
native parquet reader for handling columns that do not exist in the file and as 
well as handling the null padding. When running a basic schema evolution test 
with parquet as the base file format I see it enters the following code 
`VectorizedParquetRecordReader` and has a variable called `missingColumns`
   
   <img width="2454" height="670" alt="Screenshot 2026-01-16 at 10 59 03 AM" 
src="https://github.com/user-attachments/assets/be917961-5122-4b2c-95d0-4a6a4e7c3f92";
 />
   
   The missingColumns variable documents the following behavior around null 
padding
   ```
   /**
      * For each leaf column, if it is in the set, it means the column is 
missing in the file and
      * we'll instead return NULLs.
      */
     private Set<ParquetColumn> missingColumns;
   ```
   
[https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spar[…]xecution/datasources/parquet/VectorizedParquetRecordReader.java]
   
(https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L97)
 
   
   
   Unfortunately the Lance File reader does not behave this way and does not 
handle null padding. For example if you give the lance file reader a column 
that does not exist in the file it will immediately error out unlike parquet. 
For a quick test i tried passing the `requiredSchema` to the 
`lanceReader.readAll` and encountered the following runtime error.
   
   ```
   Caused by: java.io.IOException: Failed to read Lance file: 
/var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-14064069028248189535/dataset/test_lance_schema_evolution_copy_on_write/20bc4ff6-1eb2-4068-b87d-4105182d7306-0_0-16-19_20260116105048079.lance
        at 
org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:143)
        at 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.getFileRecordIterator(SparkFileFormatInternalRowReaderContext.scala:105)
        at 
org.apache.hudi.common.engine.HoodieReaderContext.getFileRecordIterator(HoodieReaderContext.java:273)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.makeBaseFileIterator(HoodieFileGroupReader.java:157)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.initRecordIterators(HoodieFileGroupReader.java:129)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.getBufferedRecordIterator(HoodieFileGroupReader.java:292)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.getClosableHoodieRecordIterator(HoodieFileGroupReader.java:308)
        at 
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:269)
        ... 38 more
   Caused by: java.lang.RuntimeException: LanceError(Schema): Column email does 
not exist, 
/Users/runner/work/lance/lance/rust/lance-core/src/datatypes/schema.rs:265:31
        at com.lancedb.lance.file.LanceFileReader.readAllNative(Native Method)
        at 
com.lancedb.lance.file.LanceFileReader.readAll(LanceFileReader.java:138)
        at 
org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:110)
        ... 45 more
   ```
        
   I also do not believe that the `GenerateUnsafeProjection.generate` handles 
the null padding either, since i ran into the following exception
   ```
   Caused by: java.lang.IllegalStateException: Memory was leaked by query. 
Memory leaked: (64)
   
Allocator(hudi-arrow-SparkLanceReaderBase-data-/var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-9822325871446255147/dataset/test_lance_schema_evolution_copy_on_write/19f34eb0-e71d-4b91-b08f-b2fd95e266dc-0_0-16-19_20260116111900287.lance)
 0/64/128/125829120 (res/actual/peak/limit)
   
        at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504)
        at 
org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:141)
        at 
org.apache.hudi.SparkFileFormatInternalRowReaderContext.getFileRecordIterator(SparkFileFormatInternalRowReaderContext.scala:105)
        at 
org.apache.hudi.common.engine.HoodieReaderContext.getFileRecordIterator(HoodieReaderContext.java:273)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.makeBaseFileIterator(HoodieFileGroupReader.java:157)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.initRecordIterators(HoodieFileGroupReader.java:129)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.getBufferedRecordIterator(HoodieFileGroupReader.java:292)
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.getClosableHoodieRecordIterator(HoodieFileGroupReader.java:308)
        at 
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:269)
        at org.apache.hudi.io.IOUtils.runMerge(IOUtils.java:120)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:392)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:358)
        ... 35 more
   ```
        
   After I explicitly provided a `null` valuie for fields that did not exist in 
the row
   the `testSchemaEvolutionAddColumn` would pass. See 
`LanceFileFormatHelper#generateUnsafeProjection`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to