rdblue opened a new pull request #1955: URL: https://github.com/apache/iceberg/pull/1955
This updates Spark's `DELETE FROM` command to sort the retained rows by original file and position to ensure that the original data clustering is preserved by the command. Because Spark does not yet support metadata columns, this exposes `_file` and `_pos` by adding them automatically to all merge scans. Projecting both columns was mostly supported, with only minor changes needed to project `_file` using the constants map supported by Avro, Parquet, and ORC. This also required refactoring `DynamicFileFilter`. When projecting `_file` and `_pos` but only using file, the optimizer would throw an exception that the node could not be copied because the optimizer was attempting to rewrite the node with a projection to remove the unused `_pos_`. The fix is to update `DynamicFileFilter` so that the `SupportsFileFilter` is passed separately. Then the scan can be passed as a logical plan that can be rewritten by the planner. This also required updating conversion to physical plan because the scan plan may be more complicated than a single scan node. This ensures that the scan is converted to an extended scan by using a new logical plan wrapper so that `planLater` can be used in conversion like normal. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
