[PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

via GitHub Mon, 18 May 2026 02:16:32 -0700


linliu-code opened a new pull request, #18770:
URL: https://github.com/apache/hudi/pull/18770


   ## Change Logs
   
   For `SELECT count(*)` against a COW table, 
`HoodieFileGroupReaderBasedFileFormat` already detects the case at line 252 
(`isCount = requiredSchema.isEmpty && !isMOR && !isIncremental`) but still 
routes through the vectorized reader, which opens each file and pre-fetches its 
row groups. As a result, count(*) reads ~2× the on-disk size per query 
iteration regardless of file size, where Spark's native `ParquetFileFormat` 
would only read ~footer bytes per file.
   
   This PR adds a footer-only fast path:
   
   - When `isCount=true`, call a new `readCountFromFooter` instead of routing 
through `readBaseFile`.
   - `readCountFromFooter` uses `ParquetFileReader.readFooter(..., NO_FILTER)` 
to read just the parquet metadata, sums `BlockMetaData.getRowCount()` across 
row groups, and emits either a `ColumnarBatch` (when the downstream is 
vectorized, i.e. `supportReturningBatch=true`) or an `InternalRow` iterator 
otherwise.
   - Partition columns are populated as constants from `file.partitionValues` 
via `ConstantColumnVector` so downstream WSCG codegen that touches `column[i]` 
still sees valid data.
   
   The change lives in the shared `hudi-spark-common` module, so all 
Spark-version bundles (3.3, 3.4, 3.5, 4.0) inherit the fix.
   
   ## Impact
   
   Measured against `hudi-spark3.4-bundle_2.12` built from this branch (Spark 
3.4.3, Java 11):
   
   | Scale | Partitions × rows/part | Hudi count | Raw parquet count | Hudi 
wall | Raw wall | **Wall ratio (was)** |
   |---|---|---|---|---|---|---|
   | S | 1000 × 10 | 10,000 ✓ | 10,000 ✓ | 313 ms | 296 ms | **1.06× (2.76×)** |
   | L | 100 × 10,000 | 1,000,000 ✓ | 1,000,000 ✓ | 73 ms | 59 ms | **1.24× 
(2.18×)** |
   
   bytesRead per query iteration is also halved (882 MB → 441 MB at S; 88 MB → 
44 MB at L). The residual ~50% appears to come from Hudi's larger embedded 
footer (col-stats, bloom filter) plus driver-side MDT reads — both out of scope 
for this PR.
   
   Cross-version testing (issue #18769) showed the same overhead in 0.15.0 and 
0.15.1-rc1, so this is not a 1.x regression — the missing fast path has existed 
for several releases.
   
   ## Correctness sanity (50-row table, patched bundle)
   
   - `SELECT count(*)` → 50 ✓
   - `SELECT count(*) WHERE rk<10` → 10 ✓ (non-count path with filter, 
unchanged)
   - `SELECT sum(val)` → 1225 ✓ (column-access aggregation)
   - `SELECT * FROM t LIMIT 5` → correct row values ✓
   
   The patch only adds an `if (isCount)` branch and otherwise falls through to 
existing code, so non-count queries are unaffected.
   
   ## Risk Level
   
   low
   
   Only the count-star path is changed. Non-count queries route through the 
existing code unchanged. The change is gated on `isCount = 
requiredSchema.isEmpty && !isMOR && !isIncremental` so MOR, incremental, and 
any query needing columns are untouched.
   
   ## Documentation Update
   
   None needed — this is a transparent perf optimization. User-facing behavior 
of `SELECT count(*)` is unchanged.
   
   ## Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable (see correctness sanity above; 
please advise if a dedicated regression test is desired and where it should 
live)
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

Reply via email to