alamb commented on PR #8191: URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3258420774
> > I don't think we can change the location of the row group statistics (as that is part of the format) > > What we _could_ do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed. > > In my mind what would happen is: > > 1. We create a new index type for row group stats that is stored more efficiently and can be optionally read (including reading single columns, etc., as if it were itself a parquet file). > 2. Users can opt in by not writing row group stats and writing the new index. They could write both for backwards compatibility with readers that don't support the new index type, but that might not be as performant on the read side for readers that do support the new index. Maybe not too bad if we can avoid parsing it as @etseidl is proposing but we'd still pay the IO price. FWIW this seems like a good idea to me for a particular system, but seems overly specialized for inclusion in the general arrow-rs parquet implementation. What I think does belong in the arrow-rs implementation is enough API hooks to make implementing such a specialized format feasible. I think it would be highly interesting to have an example that shows how to "add a custom index that is more efficient for whatever usecase, and then only write minimal metadata in the main footer" That could potentially satisfy the usecase of "we have files with 10000 columns and the statistics take too long to read" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
