etseidl commented on issue #8643: URL: https://github.com/apache/arrow-rs/issues/8643#issuecomment-3475799336
> > I don't quite understand this proposal > > I don't know should we separate to: > > 1. Decode all row-groups with limited columns ( or a selected set of row-groups, if compute engine has row-groups info in metadata), maybe only decode the filter columns min-max stats? > > 2. If we have all row-groups with limited columns, prune row-groups by statisitcs > > 3. Then, decode the Column Index by filter columns and limited row-groups, and fill it to `ParquetMetadata`? That sounds right to me. Let's say in step 1 there's a predicate on column x and projection of columns y and z, so we decode column metadata for x with statistics, and y/z without statistics. In step 2 we apply the predicate on x to see if we can remove any row groups off the bat, and then in 3 we fetch the column index for x and offset index for xyz. But we still need to address https://github.com/apache/arrow-rs/issues/8643#issuecomment-3464051374 Maybe internally we change all the `Vec<XMetaData>` to `Vec<Option<XMetaData>>` so indexing still works, but what to do when an invalid index is passed if we don't change the API to return `Option` (although the current APIs don't seem to do bounds checking so they'll panic on bad indexes already)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
