alamb commented on code in PR #2464: URL: https://github.com/apache/arrow-rs/pull/2464#discussion_r947253518
########## parquet/src/column/writer/mod.rs: ########## @@ -1654,10 +1655,10 @@ mod tests { assert!(stats.distinct_count().is_none()); let reader = SerializedPageReader::new( - std::io::Cursor::new(buf), - 7, - Compression::UNCOMPRESSED, - Type::INT32, + Arc::new(Bytes::from(buf)), Review Comment: I thought `Bytes` were already ref counted -- is there any need to wrap this in an additional `Arc`? ########## parquet/src/file/serialized_reader.rs: ########## @@ -471,234 +480,232 @@ pub(crate) fn decode_page( Ok(result) } -enum SerializedPages<T: Read> { - /// Read entire chunk - Chunk { buf: T }, - /// Read operate pages which can skip. +enum SerializedPageReaderState { + Values { + /// The current byte offset in the reader + offset: usize, + + /// The length of the chunk in bytes + remaining_bytes: usize, + }, Pages { - offset_index: Vec<PageLocation>, - seen_num_data_pages: usize, - has_dictionary_page_to_read: bool, - page_bufs: VecDeque<T>, + /// Remaining page locations + page_locations: VecDeque<PageLocation>, + /// Remaining dictionary location if any + dictionary_page: Option<PageLocation>, + /// The total number of rows in this column chunk + total_rows: usize, }, } /// A serialized implementation for Parquet [`PageReader`]. -pub struct SerializedPageReader<T: Read> { - // The file source buffer which references exactly the bytes for the column trunk - // to be read by this page reader. - buf: SerializedPages<T>, +pub struct SerializedPageReader<R: ChunkReader> { Review Comment: This is a non trivial change, right? To make this different than `std::io::Read`? But on the other hand the `SerializedFileReader` isn't changed -- https://docs.rs/parquet/20.0.0/parquet/file/index.html#example-of-reading-an-existing-file Maybe to ease the transition we can add some docstring example showing how to create a `ChunkReader` from a `std::io::Read`? Or maybe it doesn't matter. I think we should be sensitive and over communicate a change like this in the SerializedPageReader. @zeevm do you have any comments on this change (especially in regards to your comments on https://github.com/apache/arrow-rs/issues/2394#issuecomment-1213032232)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org