etseidl commented on PR #8191:
URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3228849980

   > I do think this would be great (even though then the rust metadata doesn't 
match the physical representation in the Parquetfile). If we could do the same 
with BloomFilters that would also resolve an outstanding wart with the API.
   
   > The trick will be to have reasonable APIs to add/strip these extra index 
structures from the RowGroupMetadata when needed
   
   I was thinking something along the lines of the `ParquetMetaDataBuilder` 
that would allow for adding the indexes to the appropriate column. Once the 
dust settles a bit I'll give this a try and see what the implications are as 
far as the `StatisticsConverter`. It should be a bit more natural to get the 
page stats out, but we'll see. It might also make projections a bit simpler.  
But the read side may be trickier if we only want a few columns for the column 
index...right now we pretty much read the entire page index to do fewer reads.
   
   > What we _could_ do is change the metadata parser so that it can skip 
parsing some/all of the row group statistics on the initial pass, and then have 
an additional API to parse the statistics when we know they will actually be 
needed.
   
   Yes, this sort of thing is _exactly_ why I started on this remodel...to 
allow selectively reading bits of the metadata we want without having to parse 
the entire footer every time we touch a file.
   
   I'm working on the page reading now, and _really_ want to just skip over the 
page stats if they exist...I don't believe they are actually used at all. Do we 
still even write them?
   
   Unrelated note...@alamb and @jhorstmann were right...better to do the reads 
with a trait rather than my weird `TryFrom` approach. I forgot about reading 
the page headers, where we don't know the size up front, so I need to use a 
`Read` to parse them rather than `&[u8]`. This will be corrected in an upcoming 
PR. Still learning...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to