jackye1995 opened a new pull request #2651:
URL: https://github.com/apache/iceberg/pull/2651


   @rdblue @openinx @chenjunjiedada 
   
   This is a bit related to #2372, but instead of adding a delete marker for 
each row, this PR directly adds a `CloseableIterable<Boolean> 
checkShouldKeep(CloseableIterable<T> records)` method that returns an iterable 
of booleans to indicate if each record should be kept or not.
   
   This is primarily to accommodate the usage of `DeleteFilter` in Trino, where 
records are supplied in batch through `Page`s, unlike the Flink and Spark cases 
where there is an explicit row level representation like `InternalRow` and 
`RowData`.
   
   For the Trino case, page is a more column oriented data structure consists 
of one `Block` per column, and each `Block` is an array of raw data. So a row 
in a page is a `page.getPosition(index)`, which is basically a bunch of raw 
data scattered in all the blocks. (this is probably an overly simplified 
explanation but should be enough context here)
   
   So in the delete case, instead of using the `CloseableIterable<T> 
filter(CloseableIterable<T> records)` method and get all the rows that are 
remaining, I find it much more efficient to know for each row, if it is deleted 
or not. With that information, I can simply reuse the same `Page` with only the 
positions that are not deleted, instead of filtering each row and reformulating 
a page with filtered rows. There is a method `page.getPositions(positionArray, 
...)` which serves exactly this purpose.
   
   The work for delete marker seems to be more oriented to generating the 
"deleted or not" information after performing the filtering and presenting this 
information to the end data frame requester. But this change would be used for 
a lower level query engine data reader.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to