Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

via GitHub Sun, 07 Sep 2025 07:39:49 -0700


alamb commented on PR #8191:
URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3258420774


   > > I don't think we can change the location of the row group statistics (as 
that is part of the format)
   > > What we _could_ do is change the metadata parser so that it can skip 
parsing some/all of the row group statistics on the initial pass, and then have 
an additional API to parse the statistics when we know they will actually be 
needed.
   > 
   > In my mind what would happen is:
   > 
   > 1. We create a new index type for row group stats that is stored more 
efficiently and can be optionally read (including reading single columns, etc., 
as if it were itself a parquet file).
   > 2. Users can opt in by not writing row group stats and writing the new 
index. They could write both for backwards compatibility with readers that 
don't support the new index type, but that might not be as performant on the 
read side for readers that do support the new index. Maybe not too bad if we 
can avoid parsing it as @etseidl is proposing but we'd still pay the IO price.
   
   FWIW this seems like a good idea to me for a particular system, but seems 
overly specialized for inclusion in the general arrow-rs parquet implementation.
   
   What I think does belong in the arrow-rs implementation is enough API hooks 
to make implementing such a specialized format feasible.
   
   I think it would be highly interesting to have an example that shows how to 
"add a custom index that is more efficient for whatever usecase, and then only 
write minimal metadata in the main footer"
   
   That could potentially satisfy the usecase of "we have files with 10000 
columns and the statistics take too long to read"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [thrift-remodel] PoC new form for column index [arrow-rs]

Reply via email to