rahil-c commented on PR #17904: URL: https://github.com/apache/hudi/pull/17904#issuecomment-3761550758
Sharing some recent findings to give context to the current implementation, for maintaining separate classes ``` LanceBasicSchemaEvolution LanceFileFormatHelper ``` Instead of reusing `SparkBasicSchemaEvolution` and `ParquetFileFormatHelper` Currently for our parquet schema evolution support, we rely on spark's native parquet reader for handling columns that do not exist in the file and as well as handling the null padding. When running a basic schema evolution test with parquet as the base file format I see it enters the following code `VectorizedParquetRecordReader` and has a variable called `missingColumns` <img width="2454" height="670" alt="Screenshot 2026-01-16 at 10 59 03 AM" src="https://github.com/user-attachments/assets/be917961-5122-4b2c-95d0-4a6a4e7c3f92" /> The missingColumns variable documents the following behavior around null padding ``` /** * For each leaf column, if it is in the set, it means the column is missing in the file and * we'll instead return NULLs. */ private Set<ParquetColumn> missingColumns; ``` [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spar[…]xecution/datasources/parquet/VectorizedParquetRecordReader.java] (https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L97) Unfortunately the Lance File reader does not behave this way and does not handle null padding. For example if you give the lance file reader a column that does not exist in the file it will immediately error out unlike parquet. For a quick test i tried passing the `requiredSchema` to the `lanceReader.readAll` and encountered the following runtime error. ``` Caused by: java.io.IOException: Failed to read Lance file: /var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-14064069028248189535/dataset/test_lance_schema_evolution_copy_on_write/20bc4ff6-1eb2-4068-b87d-4105182d7306-0_0-16-19_20260116105048079.lance at org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:143) at org.apache.hudi.SparkFileFormatInternalRowReaderContext.getFileRecordIterator(SparkFileFormatInternalRowReaderContext.scala:105) at org.apache.hudi.common.engine.HoodieReaderContext.getFileRecordIterator(HoodieReaderContext.java:273) at org.apache.hudi.common.table.read.HoodieFileGroupReader.makeBaseFileIterator(HoodieFileGroupReader.java:157) at org.apache.hudi.common.table.read.HoodieFileGroupReader.initRecordIterators(HoodieFileGroupReader.java:129) at org.apache.hudi.common.table.read.HoodieFileGroupReader.getBufferedRecordIterator(HoodieFileGroupReader.java:292) at org.apache.hudi.common.table.read.HoodieFileGroupReader.getClosableHoodieRecordIterator(HoodieFileGroupReader.java:308) at org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:269) ... 38 more Caused by: java.lang.RuntimeException: LanceError(Schema): Column email does not exist, /Users/runner/work/lance/lance/rust/lance-core/src/datatypes/schema.rs:265:31 at com.lancedb.lance.file.LanceFileReader.readAllNative(Native Method) at com.lancedb.lance.file.LanceFileReader.readAll(LanceFileReader.java:138) at org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:110) ... 45 more ``` I also do not believe that the `GenerateUnsafeProjection.generate` handles the null padding either, since i ran into the following exception ``` Caused by: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (64) Allocator(hudi-arrow-SparkLanceReaderBase-data-/var/folders/lm/0j1q1s_n09b4wgqkdqbzpbkm0000gn/T/junit-9822325871446255147/dataset/test_lance_schema_evolution_copy_on_write/19f34eb0-e71d-4b91-b08f-b2fd95e266dc-0_0-16-19_20260116111900287.lance) 0/64/128/125829120 (res/actual/peak/limit) at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:504) at org.apache.spark.sql.execution.datasources.lance.SparkLanceReaderBase.read(SparkLanceReaderBase.scala:141) at org.apache.hudi.SparkFileFormatInternalRowReaderContext.getFileRecordIterator(SparkFileFormatInternalRowReaderContext.scala:105) at org.apache.hudi.common.engine.HoodieReaderContext.getFileRecordIterator(HoodieReaderContext.java:273) at org.apache.hudi.common.table.read.HoodieFileGroupReader.makeBaseFileIterator(HoodieFileGroupReader.java:157) at org.apache.hudi.common.table.read.HoodieFileGroupReader.initRecordIterators(HoodieFileGroupReader.java:129) at org.apache.hudi.common.table.read.HoodieFileGroupReader.getBufferedRecordIterator(HoodieFileGroupReader.java:292) at org.apache.hudi.common.table.read.HoodieFileGroupReader.getClosableHoodieRecordIterator(HoodieFileGroupReader.java:308) at org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:269) at org.apache.hudi.io.IOUtils.runMerge(IOUtils.java:120) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:392) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:358) ... 35 more ``` After I explicitly provided a `null` valuie for fields that did not exist in the row the `testSchemaEvolutionAddColumn` would pass. See `LanceFileFormatHelper#generateUnsafeProjection`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
