adriangb commented on issue #21968: URL: https://github.com/apache/datafusion/issues/21968#issuecomment-4652647524
### Wide-schema perf — PR breakdown The investigation work is now split into focused PRs: **arrow-rs (primitives):** - apache/arrow-rs#9882 — O(1) `parquet_column`, `parquet_to_arrow_schema_and_field_levels` + `ArrowReaderMetadata::from_field_levels`, `StatisticsConverter::from_arrow_field`, `AsyncFileReader::get_arrow_reader_metadata`. Ready for review. **DataFusion:** - #22829 — the arrow-rs-independent wins: cache entry `memory_size` in `DefaultFilesMetadataCache`, and `apply_file_schema_type_coercions` early-return. Mergeable now. - #22830 (draft) — cache per-file `ArrowReaderMetadata` across opens. Depends on arrow-rs#9882; blocked on a build until those primitives ship in a release DataFusion bumps to (DataFusion `main` is on arrow `58.3.0`, the primitives are on arrow-rs `main`/`59.0.0`). The original `statistics_from_parquet_metadata` O(N²)→O(N) idea is **not** included — upstream #22462 already refactored that path, and combined with arrow-rs#9882's O(1) `parquet_column` it is already O(N). The earlier all-in-one draft #21987 is superseded by the split above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
