[PR] [thrift-remodel] PoC new form for column index [arrow-rs]

via GitHub Wed, 20 Aug 2025 13:49:50 -0700


etseidl opened a new pull request, #8191:
URL: https://github.com/apache/arrow-rs/pull/8191


   # Which issue does this PR close?
   **Note: this targets a feature branch, not main**
   
   - Part of #5854.
   
   # Rationale for this change
   
   Parsing the column index is _very_ slow. The largest part of the cost is 
taking the thrift structure (which is a struct of arrays) and converting it to 
an array of structs. This results in a large number of allocations when dealing 
with binary columns.
   
   This is an experiment in creating a new structure to hold the column index 
info that is a little friendlier to parse. It may also be easier to consume on 
the datafusion side.
   
   # What changes are included in this PR?
   A new `ColumnIndexMetaData` enum is added along with a type parameterized 
`NativeColumnIndex` struct.
   
   # Are these changes tested?
   No, this is an experiment only. If this work can be honed into an acceptible 
`Index` replacement, then tests will be added at that time.
   
   # Are there any user-facing changes?
   Yes, this would be a radical change to the column indexes in 
`ParquetMetaData`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [thrift-remodel] PoC new form for column index [arrow-rs]

Reply via email to