etseidl commented on issue #5854: URL: https://github.com/apache/arrow-rs/issues/5854#issuecomment-3151859881
Bumping this rather than creating a new issue. Also rolling in #7909 and #6129. Here's what I'm planning: 1. Add more thrift processing benchmarks 2. Reduce use of `parquet::format` as much as possible, especially in publicly exposed data structures like `FileMetaData`. 3. Create a custom thrift parser to decode directly to the structures created in step 2. Part of this task will address #7909 by correctly dealing with unknown union values for `LogicalType` and `ColumnOrder`. This step will also leverage the macros developed by @jhorstmann (https://github.com/jhorstmann/compact-thrift). 4. Use parser from 3 internally to read non-exposed structures such as the page headers. 5. Add ability to write new structures directly to thrift-encoded bytes. 6. Remove the `format` module. 7. Explore opportunities for further speed ups. Examples include skipping row groups and projected columns, not decoding page statistics, halt processing after reading schema. Hopfully I can have all of the above ready in time for 57.0.0 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
