etseidl commented on PR #8191: URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3228849980
> I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API. > The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed I was thinking something along the lines of the `ParquetMetaDataBuilder` that would allow for adding the indexes to the appropriate column. Once the dust settles a bit I'll give this a try and see what the implications are as far as the `StatisticsConverter`. It should be a bit more natural to get the page stats out, but we'll see. It might also make projections a bit simpler. But the read side may be trickier if we only want a few columns for the column index...right now we pretty much read the entire page index to do fewer reads. > What we _could_ do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed. Yes, this sort of thing is _exactly_ why I started on this remodel...to allow selectively reading bits of the metadata we want without having to parse the entire footer every time we touch a file. I'm working on the page reading now, and _really_ want to just skip over the page stats if they exist...I don't believe they are actually used at all. Do we still even write them? Unrelated note...@alamb and @jhorstmann were right...better to do the reads with a trait rather than my weird `TryFrom` approach. I forgot about reading the page headers, where we don't know the size up front, so I need to use a `Read` to parse them rather than `&[u8]`. This will be corrected in an upcoming PR. Still learning... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
