adriangb opened a new pull request, #22829:
URL: https://github.com/apache/datafusion/pull/22829

   ## Which issue does this PR close?
   
   Part of the wide-schema parquet read performance work in #21968.
   
   - Part of #21968.
   
   ## Rationale for this change
   
   Scanning parquet datasets with very wide schemas (hundreds/thousands of
   columns) pays per-file CPU costs that scale with schema width even when a
   query touches only a handful of columns. Two of those costs are pure
   DataFusion-side overhead with no dependency on arrow-rs, so they can land
   independently of the larger arrow-metadata caching work tracked in #21968
   / #21987:
   
   1. `DefaultFilesMetadataCache` recomputed `FileMetadata::memory_size()` —
      which walks the entire metadata structure — on every put, eviction,
      and remove. For wide files the metadata is large, so this structural
      walk on the cache hot path is significant.
   2. `apply_file_schema_type_coercions` always built a `HashMap` of every
      table field up front, even on the common path where no view/string
      coercion is needed and the map is immediately discarded.
   
   ## What changes are included in this PR?
   
   Two small, independent commits:
   
   - **Cache entry `memory_size` in `DefaultFilesMetadataCache`.** Store each
     entry's size alongside it (`SizedCacheEntry`), computed once at
     insertion, so put/evict/remove no longer re-walk the metadata.
   - **Skip the coercion lookup map when no coercion is needed.** Do a cheap
     flag-only first pass over the table fields and only build the
     name→type `HashMap` when a transformation is actually required.
   
   ## Are these changes tested?
   
   Covered by existing tests — `datafusion-execution` cache tests and the
   `schema_coercion` tests in `datafusion-datasource-parquet` pass. The
   changes preserve existing behavior; they only remove redundant work.
   
   ## Are there any user-facing changes?
   
   No. `SizedCacheEntry` is an internal cache type; no public API changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to