alamb commented on code in PR #6945:
URL: https://github.com/apache/arrow-rs/pull/6945#discussion_r1904664517
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -568,6 +568,62 @@ impl<R: ChunkReader> SerializedPageReader<R> {
physical_type: meta.column_type(),
})
}
+
+ /// Similar to `peek_next_page`, but returns the offset of the next page
instead of the page metadata.
+ /// Unlike page metadata, an offset can uniquely identify a page.
+ ///
+ /// This is used when we need to read parquet with row-filter, and we
don't want to decompress the page twice.
+ /// This function allows us to check if the next page is being cached or
read previously.
+ pub fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {
+ match &mut self.state {
+ SerializedPageReaderState::Values {
+ offset,
+ remaining_bytes,
+ next_page_header,
+ } => {
+ loop {
+ if *remaining_bytes == 0 {
Review Comment:
my only real concern is the fact that this body has so much duplication with
`peek_next_page` (especially in the SerializedPageReaderState::Values block)
it is also somewhat strange it is in a different `impl` block than
`peek_next_page` (I would have expected it to be next to it) but maybe I missed
some generic subtlety
I tried a few ways to avoid the duplication and I didn't really find any
good way to do so,
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -568,6 +568,62 @@ impl<R: ChunkReader> SerializedPageReader<R> {
physical_type: meta.column_type(),
})
}
+
+ /// Similar to `peek_next_page`, but returns the offset of the next page
instead of the page metadata.
+ /// Unlike page metadata, an offset can uniquely identify a page.
+ ///
+ /// This is used when we need to read parquet with row-filter, and we
don't want to decompress the page twice.
+ /// This function allows us to check if the next page is being cached or
read previously.
+ pub fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {
Review Comment:
I think it would be better to make this non pub as I don't think it needs to
be a public API (`peek_next_page` is not `pub` either)
I removed the `pub` and got a "not used" error, but I think that is because
it will be used in the future as part of
https://github.com/apache/arrow-rs/pull/6921
This seemed to work for me locally:
```suggestion
#[cfg(test)]
fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]