Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

via GitHub Wed, 27 Aug 2025 09:32:17 -0700


adriangb commented on PR #8191:
URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3228886756


   > I don't think we can change the location of the row group statistics (as 
that is part of the format)
   > 
   > What we _could_ do is change the metadata parser so that it can skip 
parsing some/all of the row group statistics on the initial pass, and then have 
an additional API to parse the statistics when we know they will actually be 
needed.
   
   In my mind what would happen is:
   1. We create a new index type for row group stats that is stored more 
efficiently and can be optionally read (including reading single columns, etc., 
as if it were itself a parquet file).
   2. Users can opt in by not writing row group stats and writing the new 
index. They could write both for backwards compatibility with readers that 
don't support the new index type, but that might not be as performant on the 
read side for readers that do support the new index. Maybe not too bad if we 
can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

Reply via email to