[PR] Alamb/refactor push decoder [arrow-rs]

via GitHub Fri, 12 Sep 2025 12:48:15 -0700


alamb opened a new pull request, #8340:
URL: https://github.com/apache/arrow-rs/pull/8340


   # Which issue does this PR close?
   
   - part of #8000 
   - Follow on to https://github.com/apache/arrow-rs/pull/8080
   
   # Rationale for this change
   
   The current ParquetMetadataDecoder intermixes three things:
   1. The state machine for decoding parquet metadata (footer, then metadata, 
then (optional) indexes)
   2. orchestrating IO (aka calling read, etc)
   3. Decoding thrift encoded byte into objets
   
   This makes it almost impossible to add features like "only decode a subset 
of the columns in the ColumnIndex" and other potentially advanced usecases
   
   Now that we have a "push" style API for metadata decoding that avoids IO, 
the next step is to extract out the actual work into this API so that the 
existing ParquetMetadataDecoder just calls into the PushDecoder
   
   # What changes are included in this PR?
   
   1. Extract decoding  state machine into PushMetadataDecoder
   2. Update ParquetMetadataDecoder to use the PushMetadataDecoder
   3. Extract the bytes --> object code into its own module
   
   This almost certainly will conflict with @etseidl 's plans in 
thrift-remodel. 
   
   # Are these changes tested?
   by existing tests
   
   # Are there any user-facing changes?
   
   Not really -- this is an internal change that will make it easier to add 
features like "only decode a subset of the columns in the ColumnIndex, for 
example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Alamb/refactor push decoder [arrow-rs]

Reply via email to