mbutrovich opened a new pull request, #4760:
URL: https://github.com/apache/datafusion-comet/pull/4760

   ## Which issue does this PR close?
   
   Closes #4723.
   
   ## Rationale for this change
   
   The native Iceberg scan returned more rows than Spark on MOR tables with 
positional deletes (the reporter in #4723 saw ~42% extra rows and a `WARN 
CometIcebergNativeScan: Failed to serialize delete file: null`). Root cause: 
the JVM serializer sized each delete file with a `FileIO.getLength()` HEAD on a 
reconstructed FileIO, and on any failure (e.g. a path the reconstructed FileIO 
could not resolve) it silently dropped the delete file. Dropped deletes mean 
deleted rows leak back into the scan, producing wrong counts and inflated 
aggregates.
   
   The size was never needed on the JVM side. iceberg-rust seeks each delete 
file's Parquet footer from `file_size_in_bytes`, and the native scan already 
holds a working FileIO (the one that reads the data files). So the native side 
can stat the delete file itself, with the same FileIO, instead of trusting a 
fragile JVM-side HEAD or the manifest value (which is itself untrustworthy, see 
apache/iceberg#12554).
   
   ## What changes are included in this PR?
   
   - Stop sizing delete files on the JVM side and remove `file_size_in_bytes` 
from the `IcebergDeleteFile` proto message. A delete file whose path cannot be 
extracted is now fatal rather than silently skipped.
   - Native scan fills each unique delete file's size as the first step of the 
`FileScanTaskStream`, statting via `FileIO::new_input(path).metadata()` on the 
iceberg runtime (deduped, concurrent). A stat failure is fatal and surfaces as 
a query error, never a silent row leak.
   
   ## How are these changes tested?
   
   - Native unit tests for `fill_delete_file_sizes`: errors on an unreadable 
delete file, populates the real size, no-op without deletes.
   - `CometIcebergNativeSuite`: existing MOR positional/equality delete tests 
now exercise the native fill; new tests cover deletes after 
`rewrite_data_files` compaction and assert that a missing delete file fails the 
scan loudly instead of dropping deletes.
   - `IcebergReadFromS3Suite`: MOR-with-deletes over MinIO exercises the fill 
against an S3 object store.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to