adriangb opened a new pull request, #22829:
URL: https://github.com/apache/datafusion/pull/22829
## Which issue does this PR close?
Part of the wide-schema parquet read performance work in #21968.
- Part of #21968.
## Rationale for this change
Scanning parquet datasets with very wide schemas (hundreds/thousands of
columns) pays per-file CPU costs that scale with schema width even when a
query touches only a handful of columns. Two of those costs are pure
DataFusion-side overhead with no dependency on arrow-rs, so they can land
independently of the larger arrow-metadata caching work tracked in #21968
/ #21987:
1. `DefaultFilesMetadataCache` recomputed `FileMetadata::memory_size()` —
which walks the entire metadata structure — on every put, eviction,
and remove. For wide files the metadata is large, so this structural
walk on the cache hot path is significant.
2. `apply_file_schema_type_coercions` always built a `HashMap` of every
table field up front, even on the common path where no view/string
coercion is needed and the map is immediately discarded.
## What changes are included in this PR?
Two small, independent commits:
- **Cache entry `memory_size` in `DefaultFilesMetadataCache`.** Store each
entry's size alongside it (`SizedCacheEntry`), computed once at
insertion, so put/evict/remove no longer re-walk the metadata.
- **Skip the coercion lookup map when no coercion is needed.** Do a cheap
flag-only first pass over the table fields and only build the
name→type `HashMap` when a transformation is actually required.
## Are these changes tested?
Covered by existing tests — `datafusion-execution` cache tests and the
`schema_coercion` tests in `datafusion-datasource-parquet` pass. The
changes preserve existing behavior; they only remove redundant work.
## Are there any user-facing changes?
No. `SizedCacheEntry` is an internal cache type; no public API changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]