Re: [I] [Parquet] reader appears to have bug when supporting unknown sort orders [arrow-rs]

via GitHub Wed, 30 Jul 2025 12:56:27 -0700


etseidl commented on issue #7909:
URL: https://github.com/apache/arrow-rs/issues/7909#issuecomment-3137639506


   Quick follow up. I've gone down the rabbit hole of a custom implementation. 
So far I've found that the thrift code in `TCompactSliceInputProtocol` is 
pretty good 😄, but by repeating essentially what @jhorstmann and @tustvold had 
previously done (streamline some code, avoid string allocations, etc), once 
again got to the point of over a 2X improvement over using the thrift generated 
`read_from_in_protocol`. I'm now taking that a step further to go directly from 
bytes to parquet-rs structures (see 
https://github.com/apache/arrow-rs/issues/5854#issuecomment-2175774452). Right 
now all I have implemented is producing the `Arc<Type>` schema directly, rather 
than producing an array of `SchemaElements` and then post processing. By way of 
benchmarking, I grab the bytes for the footer from 
`alltypes_tiny_pages.parquet` from parquet-testing, and parse that a million 
times. Results on my old Mac laptop are:
   
   Full decode to `ParquetMetaData` (no column index): 52s
   Full read of `format::FileMetaData`: 30s
   Full read of hand rolled `FileMetaData`: 13s
   Read of `[format::SchemaElement]` and conversion to `Arc<Type>`: 9s
   Hand coded read from bytes to `Arc<Type>`: 6s
   Time to fully skip metadata with existing parser: 13s
   Time to fully skip metadata with new parser: 5.7s
   
   As an aside, there's a bug in the thrift implementation of `skip`...byte 
arrays are all assumed to be strings, so when trying to skip min/max statistics 
it throws a non-UTF8 error.
   
   I hope to be able to tackle the row group metadata next week. There should 
be a lot to gain there as even the new parser spends a considerable amount of 
time allocating memory for `Vec`s...hopefully we can avoid double allocations 
(currently once for thrift structs, once for rust structs). Given a custom 
parser, we could then do interesting things like only read the schema 
initially, then on a subsequent call skip the schema and go right to the row 
group meta. We could use pruning info to avoid parsing entire row groups, 
instead skipping over them which is considerably faster. Same for individual 
columns.
   
   As far as a road map, I'm finding while doing this exercise that the mixing 
of structures in the `format` and `basic` modules is not ideal. I think first 
removing any use of `format` within the crate will help with swapping out 
thrift parsers down the road.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Parquet] reader appears to have bug when supporting unknown sort orders [arrow-rs]

Reply via email to