alamb commented on code in PR #6945:
URL: https://github.com/apache/arrow-rs/pull/6945#discussion_r1904664517


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -568,6 +568,62 @@ impl<R: ChunkReader> SerializedPageReader<R> {
             physical_type: meta.column_type(),
         })
     }
+
+    /// Similar to `peek_next_page`, but returns the offset of the next page 
instead of the page metadata.
+    /// Unlike page metadata, an offset can uniquely identify a page.
+    ///
+    /// This is used when we need to read parquet with row-filter, and we 
don't want to decompress the page twice.
+    /// This function allows us to check if the next page is being cached or 
read previously.
+    pub fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {
+        match &mut self.state {
+            SerializedPageReaderState::Values {
+                offset,
+                remaining_bytes,
+                next_page_header,
+            } => {
+                loop {
+                    if *remaining_bytes == 0 {

Review Comment:
   my only real concern is the fact that this body has so much duplication with 
`peek_next_page` (especially in the SerializedPageReaderState::Values block)
   
   it is also somewhat strange it is in a different `impl` block than 
`peek_next_page` (I would have expected it to be next to it) but maybe I missed 
some generic subtlety
   
   I tried a few ways to avoid the duplication and I didn't really find any 
good way to do so, 



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -568,6 +568,62 @@ impl<R: ChunkReader> SerializedPageReader<R> {
             physical_type: meta.column_type(),
         })
     }
+
+    /// Similar to `peek_next_page`, but returns the offset of the next page 
instead of the page metadata.
+    /// Unlike page metadata, an offset can uniquely identify a page.
+    ///
+    /// This is used when we need to read parquet with row-filter, and we 
don't want to decompress the page twice.
+    /// This function allows us to check if the next page is being cached or 
read previously.
+    pub fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {

Review Comment:
   I think it would be better to make this non pub as I don't think it needs to 
be a public API (`peek_next_page` is not `pub` either)
   
   I removed the `pub` and got a "not used" error, but I think that is because 
it will be used in the future as part of 
https://github.com/apache/arrow-rs/pull/6921
   
   This seemed to work for me locally:
   ```suggestion
       #[cfg(test)]
       fn peek_next_page_offset(&mut self) -> Result<Option<usize>> {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to