xudong963 opened a new pull request, #22353:
URL: https://github.com/apache/datafusion/pull/22353

   ## Which issue does this PR close?
   
   - Closes #22189.
   
   ## Rationale for this change
   
   Parquet scans currently register a full set of per-file metrics when each
   `ParquetFileMetrics` is created, even when most of those metrics remain zero.
   This is costly for scans over many parquet files because metric registration
   clones labels such as `filename` and appends many unused metrics to the plan
   metrics set.
   
   ## What changes are included in this PR?
   
   - Create parquet file metric handles without immediately registering every
     metric.
   - Add a lazy registration guard that registers only non-zero parquet file
     metrics when the file metrics are dropped.
   - Keep the registration guard alive for push decoder streams so cloned metric
     handles can be updated throughout streaming before registration occurs.
   - Preserve the existing lazy behavior for
     `page_index_pages_skipped_by_fully_matched`.
   - Update parquet tests to allow zero-valued metrics to be absent while still
     requiring metrics that are expected to have non-zero values.
   - Add focused unit coverage for lazy registration of `Count`, `Gauge`, 
`Time`,
     `PruningMetrics`, and `RatioMetrics`.
   
   ## Are these changes tested?
   
   Yes.
   
   ```bash
   cargo fmt --all
   cargo clippy --all-targets --all-features -- -D warnings
   cargo test -p datafusion-datasource-parquet --lib
   cargo test -p datafusion --test parquet_integration
   ```
   
   I also ran a local release microbenchmark for repeated `ParquetFileMetrics`
   creation against `origin/main`; for 50k file metrics, zero-metric 
registration
   improved from about 57.9ms to 25.2ms, and the `bytes_scanned` case improved
   from about 57.7ms to 28.4ms.
   
   ## Are there any user-facing changes?
   
   There are no SQL result or schema changes.
   
   The observable metrics output changes: parquet file metrics that remain zero
   may now be absent from the metrics set instead of being present with value 
`0`.
   Low-level integrations should treat missing parquet file metrics as zero when
   appropriate.
   
   `ParquetFileMetrics` is a low-level public struct that is documented as 
subject
   to change. This PR adds an internal registration guard field, so external 
Rust
   code should construct it via `ParquetFileMetrics::new`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to