openinx commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-810092443


   > Instead, I think that the way to do this is to select all rows and set a 
metadata column to indicate whether or not the row is deleted.
   
   I've tried to think about how to add the `_is_deleted` metadata column for 
each record read from Parquet/Orc Readers. The workflow should be: 
   
   1.   Add a boolean reader at the tail when constructing the Parquet/Orc 
Readers for the given iceberg schema,  the boolean reader will just fill a 
default value `false` for each record.   The real value will be filled after 
checked the equality delete files & pos delete files iteratively; 
   2.  The struct parquet/orc reader read the whole row,  now the `_is_deleted` 
value is `false` by default; 
   3.  Check the equality delete files and position delete files.  set the 
`_is_deleted` to be `true` if the row has been deleted successfully.  it will 
require the flink RowData & spark InternalRow provide `setValue(pos, value)` 
interface to update the real value of `_is_deleted`.
   4.  Return `Iterable<Row>`.
   
   The most complicated work occurs in the third step,  because we will need to 
refactor all the `Deletes#filter` path to return a boolean flag for a row , 
rather than just returning the filtered `iterable<T>`.  This will mean that we 
have almost refactored the logic related to delete filter. Now I am a little 
hesitant whether it is necessary to do this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to