jackye1995 opened a new pull request #2651: URL: https://github.com/apache/iceberg/pull/2651
@rdblue @openinx @chenjunjiedada This is a bit related to #2372, but instead of adding a delete marker for each row, this PR directly adds a `CloseableIterable<Boolean> checkShouldKeep(CloseableIterable<T> records)` method that returns an iterable of booleans to indicate if each record should be kept or not. This is primarily to accommodate the usage of `DeleteFilter` in Trino, where records are supplied in batch through `Page`s, unlike the Flink and Spark cases where there is an explicit row level representation like `InternalRow` and `RowData`. For the Trino case, page is a more column oriented data structure consists of one `Block` per column, and each `Block` is an array of raw data. So a row in a page is a `page.getPosition(index)`, which is basically a bunch of raw data scattered in all the blocks. (this is probably an overly simplified explanation but should be enough context here) So in the delete case, instead of using the `CloseableIterable<T> filter(CloseableIterable<T> records)` method and get all the rows that are remaining, I find it much more efficient to know for each row, if it is deleted or not. With that information, I can simply reuse the same `Page` with only the positions that are not deleted, instead of filtering each row and reformulating a page with filtered rows. There is a method `page.getPositions(positionArray, ...)` which serves exactly this purpose. The work for delete marker seems to be more oriented to generating the "deleted or not" information after performing the filtering and presenting this information to the end data frame requester. But this change would be used for a lower level query engine data reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
