linliu-code opened a new pull request, #18770: URL: https://github.com/apache/hudi/pull/18770
## Change Logs For `SELECT count(*)` against a COW table, `HoodieFileGroupReaderBasedFileFormat` already detects the case at line 252 (`isCount = requiredSchema.isEmpty && !isMOR && !isIncremental`) but still routes through the vectorized reader, which opens each file and pre-fetches its row groups. As a result, count(*) reads ~2× the on-disk size per query iteration regardless of file size, where Spark's native `ParquetFileFormat` would only read ~footer bytes per file. This PR adds a footer-only fast path: - When `isCount=true`, call a new `readCountFromFooter` instead of routing through `readBaseFile`. - `readCountFromFooter` uses `ParquetFileReader.readFooter(..., NO_FILTER)` to read just the parquet metadata, sums `BlockMetaData.getRowCount()` across row groups, and emits either a `ColumnarBatch` (when the downstream is vectorized, i.e. `supportReturningBatch=true`) or an `InternalRow` iterator otherwise. - Partition columns are populated as constants from `file.partitionValues` via `ConstantColumnVector` so downstream WSCG codegen that touches `column[i]` still sees valid data. The change lives in the shared `hudi-spark-common` module, so all Spark-version bundles (3.3, 3.4, 3.5, 4.0) inherit the fix. ## Impact Measured against `hudi-spark3.4-bundle_2.12` built from this branch (Spark 3.4.3, Java 11): | Scale | Partitions × rows/part | Hudi count | Raw parquet count | Hudi wall | Raw wall | **Wall ratio (was)** | |---|---|---|---|---|---|---| | S | 1000 × 10 | 10,000 ✓ | 10,000 ✓ | 313 ms | 296 ms | **1.06× (2.76×)** | | L | 100 × 10,000 | 1,000,000 ✓ | 1,000,000 ✓ | 73 ms | 59 ms | **1.24× (2.18×)** | bytesRead per query iteration is also halved (882 MB → 441 MB at S; 88 MB → 44 MB at L). The residual ~50% appears to come from Hudi's larger embedded footer (col-stats, bloom filter) plus driver-side MDT reads — both out of scope for this PR. Cross-version testing (issue #18769) showed the same overhead in 0.15.0 and 0.15.1-rc1, so this is not a 1.x regression — the missing fast path has existed for several releases. ## Correctness sanity (50-row table, patched bundle) - `SELECT count(*)` → 50 ✓ - `SELECT count(*) WHERE rk<10` → 10 ✓ (non-count path with filter, unchanged) - `SELECT sum(val)` → 1225 ✓ (column-access aggregation) - `SELECT * FROM t LIMIT 5` → correct row values ✓ The patch only adds an `if (isCount)` branch and otherwise falls through to existing code, so non-count queries are unaffected. ## Risk Level low Only the count-star path is changed. Non-count queries route through the existing code unchanged. The change is gated on `isCount = requiredSchema.isEmpty && !isMOR && !isIncremental` so MOR, incremental, and any query needing columns are untouched. ## Documentation Update None needed — this is a transparent perf optimization. User-facing behavior of `SELECT count(*)` is unchanged. ## Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable (see correctness sanity above; please advise if a dedicated regression test is desired and where it should live) - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
