etseidl commented on issue #7909: URL: https://github.com/apache/arrow-rs/issues/7909#issuecomment-3137639506
Quick follow up. I've gone down the rabbit hole of a custom implementation. So far I've found that the thrift code in `TCompactSliceInputProtocol` is pretty good 😄, but by repeating essentially what @jhorstmann and @tustvold had previously done (streamline some code, avoid string allocations, etc), once again got to the point of over a 2X improvement over using the thrift generated `read_from_in_protocol`. I'm now taking that a step further to go directly from bytes to parquet-rs structures (see https://github.com/apache/arrow-rs/issues/5854#issuecomment-2175774452). Right now all I have implemented is producing the `Arc<Type>` schema directly, rather than producing an array of `SchemaElements` and then post processing. By way of benchmarking, I grab the bytes for the footer from `alltypes_tiny_pages.parquet` from parquet-testing, and parse that a million times. Results on my old Mac laptop are: Full decode to `ParquetMetaData` (no column index): 52s Full read of `format::FileMetaData`: 30s Full read of hand rolled `FileMetaData`: 13s Read of `[format::SchemaElement]` and conversion to `Arc<Type>`: 9s Hand coded read from bytes to `Arc<Type>`: 6s Time to fully skip metadata with existing parser: 13s Time to fully skip metadata with new parser: 5.7s As an aside, there's a bug in the thrift implementation of `skip`...byte arrays are all assumed to be strings, so when trying to skip min/max statistics it throws a non-UTF8 error. I hope to be able to tackle the row group metadata next week. There should be a lot to gain there as even the new parser spends a considerable amount of time allocating memory for `Vec`s...hopefully we can avoid double allocations (currently once for thrift structs, once for rust structs). Given a custom parser, we could then do interesting things like only read the schema initially, then on a subsequent call skip the schema and go right to the row group meta. We could use pruning info to avoid parsing entire row groups, instead skipping over them which is considerably faster. Same for individual columns. As far as a road map, I'm finding while doing this exercise that the mixing of structures in the `format` and `basic` modules is not ideal. I think first removing any use of `format` within the crate will help with swapping out thrift parsers down the road. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org