alamb commented on issue #5854: URL: https://github.com/apache/arrow-rs/issues/5854#issuecomment-3154746369
> Bumping this rather than creating a new issue. Also rolling in [#7909](https://github.com/apache/arrow-rs/issues/7909) and [#6129](https://github.com/apache/arrow-rs/issues/6129). > > Here's what I'm planning: > > 1. Add more thrift processing benchmarks > 2. Reduce use of `parquet::format` as much as possible, especially in publicly exposed data structures like `FileMetaData`. This will be awesome -- I have likewise found the format module very hard to work with > 4. Create a custom thrift parser to decode directly to the structures created in step 2. Part of this task will address [[Parquet] reader appears to have bug when supporting unknown sort orders #7909](https://github.com/apache/arrow-rs/issues/7909) by correctly dealing with unknown union values for `LogicalType` and `ColumnOrder`. This step will also leverage the macros developed by [@jhorstmann](https://github.com/jhorstmann) (https://github.com/jhorstmann/compact-thrift). > 6. Use parser from 3 internally to read non-exposed structures such as the page headers. This will be awesome -- when I was working on the parquet push decoder I also found that the existing thrift decoder copies data that is already in memory *AGAIN* using the `std::io::Read` trait -- it would be awesome if the new decoder took data in in memory `Buffer`. Note sure what your plans are, but I would like to recommend a "push" decoder that avoids having to do IO at all - similarly to what I have in https://github.com/apache/arrow-rs/pull/7997. A taste from `parquet/src/file/metadata/push_decoder.rs` ```rust /// The `ParquetMetaDataPushDecoder` needs to know the file length. /// let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap(); /// // try to decode the metadata. If more data is needed, the decoder will tell you what ranges /// loop { /// match decoder.try_decode() { /// Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful /// Ok(DecodeResult::NeedsData(ranges)) => { /// // The decoder needs more data /// // /// // In this example, we call a function that returns the bytes for each given range. /// // In a real application, you would likely read the data from a file or network. /// let data = ranges.iter().map(|range| get_range(range)).collect(); /// // Push the data into the decoder and try to decode again on the next iteration. /// decoder.push_ranges(ranges, data).unwrap(); /// } /// Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") } /// Err(e) => return Err(e), /// } /// } /// # } ``` > 8. Add ability to write new structures directly to thrift-encoded bytes. > 9. Remove the `format` module > 11. Explore opportunities for further speed ups. Examples include skipping row groups and projected columns, not decoding page statistics, halt processing after reading schema.> > Hopfully I can have all of the above ready in time for 57.0.0 😅 Sounds ambitious! However, I am around to review things -- l please feel free to ping me and I'll try my best to get them reviewed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
