[GitHub] [arrow-rs] alamb commented on a diff in pull request #2464: Push ChunkReader into SerializedPageReader (#2463)

GitBox Tue, 16 Aug 2022 14:20:37 -0700


alamb commented on code in PR #2464:
URL: https://github.com/apache/arrow-rs/pull/2464#discussion_r947253518



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1654,10 +1655,10 @@ mod tests {
         assert!(stats.distinct_count().is_none());
 
         let reader = SerializedPageReader::new(
-            std::io::Cursor::new(buf),
-            7,
-            Compression::UNCOMPRESSED,
-            Type::INT32,
+            Arc::new(Bytes::from(buf)),

Review Comment:
   I thought `Bytes` were already ref counted -- is there any need to wrap this 
in an additional `Arc`? 



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -471,234 +480,232 @@ pub(crate) fn decode_page(
     Ok(result)
 }
 
-enum SerializedPages<T: Read> {
-    /// Read entire chunk
-    Chunk { buf: T },
-    /// Read operate pages which can skip.
+enum SerializedPageReaderState {
+    Values {
+        /// The current byte offset in the reader
+        offset: usize,
+
+        /// The length of the chunk in bytes
+        remaining_bytes: usize,
+    },
     Pages {
-        offset_index: Vec<PageLocation>,
-        seen_num_data_pages: usize,
-        has_dictionary_page_to_read: bool,
-        page_bufs: VecDeque<T>,
+        /// Remaining page locations
+        page_locations: VecDeque<PageLocation>,
+        /// Remaining dictionary location if any
+        dictionary_page: Option<PageLocation>,
+        /// The total number of rows in this column chunk
+        total_rows: usize,
     },
 }
 
 /// A serialized implementation for Parquet [`PageReader`].
-pub struct SerializedPageReader<T: Read> {
-    // The file source buffer which references exactly the bytes for the 
column trunk
-    // to be read by this page reader.
-    buf: SerializedPages<T>,
+pub struct SerializedPageReader<R: ChunkReader> {

Review Comment:
   This is a non trivial change, right? To make this different than 
`std::io::Read`? But on the other hand the `SerializedFileReader` isn't changed 
-- 
https://docs.rs/parquet/20.0.0/parquet/file/index.html#example-of-reading-an-existing-file
   
   Maybe to ease the transition we can add some docstring example showing how 
to create a `ChunkReader` from a `std::io::Read`?  Or maybe it doesn't matter. 
   
   I think we should be sensitive and over communicate a change like this in 
the SerializedPageReader.
   
   @zeevm  do you have any comments on this change (especially in regards to 
your comments on 
https://github.com/apache/arrow-rs/issues/2394#issuecomment-1213032232)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on a diff in pull request #2464: Push ChunkReader into SerializedPageReader (#2463)

Reply via email to