openinx commented on pull request #2372: URL: https://github.com/apache/iceberg/pull/2372#issuecomment-810092443
> Instead, I think that the way to do this is to select all rows and set a metadata column to indicate whether or not the row is deleted. I've tried to think about how to add the `_is_deleted` metadata column for each record read from Parquet/Orc Readers. The workflow should be: 1. Add a boolean reader at the tail when constructing the Parquet/Orc Readers for the given iceberg schema, the boolean reader will just fill a default value `false` for each record. The real value will be filled after checked the equality delete files & pos delete files iteratively; 2. The struct parquet/orc reader read the whole row, now the `_is_deleted` value is `false` by default; 3. Check the equality delete files and position delete files. set the `_is_deleted` to be `true` if the row has been deleted successfully. it will require the flink RowData & spark InternalRow provide `setValue(pos, value)` interface to update the real value of `_is_deleted`. 4. Return `Iterable<Row>`. The most complicated work occurs in the third step, because we will need to refactor all the `Deletes#filter` path to return a boolean flag for a row , rather than just returning the filtered `iterable<T>`. This will mean that we have almost refactored the logic related to delete filter. Now I am a little hesitant whether it is necessary to do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
