etseidl commented on issue #8643:
URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3475799336

   > > I don't quite understand this proposal
   > 
   > I don't know should we separate to:
   > 
   >     1. Decode all row-groups with limited columns ( or a selected set of 
row-groups, if compute engine has row-groups info in metadata), maybe only 
decode the filter columns min-max stats?
   > 
   >     2. If we have all row-groups with limited columns, prune row-groups by 
statisitcs
   > 
   >     3. Then, decode the Column Index by filter columns and limited 
row-groups, and fill it to `ParquetMetadata`?
   
   That sounds right to me. Let's say in step 1 there's a predicate on column x 
and projection of columns y and z, so we decode column metadata for x with 
statistics, and y/z without statistics. In step 2 we apply the predicate on x 
to see if we can remove any row groups off the bat, and then in 3 we fetch the 
column index for x and offset index for xyz.
   
   But we still need to address 
https://github.com/apache/arrow-rs/issues/8643#issuecomment-3464051374
   
   Maybe internally we change all the `Vec<XMetaData>` to 
`Vec<Option<XMetaData>>` so indexing still works, but what to do when an 
invalid index is passed if we don't change the API to return `Option` (although 
the current APIs don't seem to do bounds checking so they'll panic on bad 
indexes already)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to