alamb commented on PR #8191: URL: https://github.com/apache/arrow-rs/pull/8191#issuecomment-3228387315
> And while we're breaking things, do we actually like the column and page indexes as Vec<Vec<index>>? I do not like this structure and find it mind mending any time I have to deal with it > I'm wondering if tacking them individually onto the ColumnChunkMetaData as Option<index> would be more intuitive? I do think this would be great (even though then the rust metadata doesn't match the physical representation in the Parquetfile). If we could do the same with BloomFilters that would also resolve an outstanding wart with the API. The trick will be to have reasonable APIs to add/strip these extra index structures from the RowGroupMetadata when needed > This is part something DataFusion could do, part up to Parquet itself. In my mind one of the the biggest mistakes / bottlenecks is trying to load row group stats as part of the thrift data. > IMO the footer / thrift data should be basic metadata (the schema) + data offsets + index type and index offsets. Then the index types / offsets are used to load page indexes, bloom filters, row group zone maps and future types of indexes. But that's a very ambitious goal that I don't have the bandwidth to push for atm. I don't think we can change the location of the row group statistics (as that is part of the format) What we *could* do is change the metadata parser so that it can skip parsing some/all of the row group statistics on the initial pass, and then have an additional API to parse the statistics when we know they will actually be needed. This is my understanding of what @adrian-thurston suggested in this DataFusion issue: - https://github.com/apache/datafusion/issues/16200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
