[GitHub] [iceberg] rdblue opened a new pull request #1955: Spark: Sort retained rows in DELETE FROM by file and position

GitBox Thu, 17 Dec 2020 17:17:07 -0800


rdblue opened a new pull request #1955:
URL: https://github.com/apache/iceberg/pull/1955



   This updates Spark's `DELETE FROM` command to sort the retained rows by 
original file and position to ensure that the original data clustering is 
preserved by the command.
   
   Because Spark does not yet support metadata columns, this exposes `_file` 
and `_pos` by adding them automatically to all merge scans. Projecting both 
columns was mostly supported, with only minor changes needed to project `_file` 
using the constants map supported by Avro, Parquet, and ORC.
   
   This also required refactoring `DynamicFileFilter`. When projecting `_file` 
and `_pos` but only using file, the optimizer would throw an exception that the 
node could not be copied because the optimizer was attempting to rewrite the 
node with a projection to remove the unused `_pos_`. The fix is to update 
`DynamicFileFilter` so that the `SupportsFileFilter` is passed separately. Then 
the scan can be passed as a logical plan that can be rewritten by the planner. 
This also required updating conversion to physical plan because the scan plan 
may be more complicated than a single scan node. This ensures that the scan is 
converted to an extended scan by using a new logical plan wrapper so that 
`planLater` can be used in conversion like normal.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue opened a new pull request #1955: Spark: Sort retained rows in DELETE FROM by file and position

Reply via email to