Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

via GitHub Mon, 18 May 2026 19:25:54 -0700


linliu-code commented on PR #18770:
URL: https://github.com/apache/hudi/pull/18770#issuecomment-4483891741


   Thanks for the careful review! I addressed all five comments in commit 
9701cb2. Per-comment responses:
   
   **Comment 1 (splittable files, line 543) — FIXED.** This was a real 
correctness bug. The footer's row groups are now filtered to only those whose 
`getStartingPos` falls inside the current split:
   
   ```scala
   val splitStart = file.start
   val splitEnd = file.start + file.length
   val rowCount = footer.getBlocks.asScala
     .filter(b => b.getStartingPos >= splitStart && b.getStartingPos < splitEnd)
     .foldLeft(0L)((acc, b) => acc + b.getRowCount)
   ```
   
   This matches the criterion `VectorizedParquetRecordReader` uses internally. 
Verified locally: with `spark.sql.files.maxPartitionBytes=64KB` and 
`hoodie.parquet.block.size=32KB`, my 100 base files split into 1000 Spark 
partitions, and `count(*)` still returns the expected 1,000,000 (would be ~10× 
without the filter).
   
   **Comment 2 (ORC/Lance, line 541) — FIXED.** Added a file-format gate 
alongside `isCount`:
   
   ```scala
   val canUseCountFastPath = isCount &&
     hoodieFileFormat == HoodieFileFormat.PARQUET &&
     !isMultipleBaseFileFormatsEnabled
   ```
   
   ORC and Lance COW tables, plus mixed-format tables, now fall through to 
`readBaseFile` (the existing code path).
   
   **Comment 3 (null match, line 545) — FIXED.** Added an explicit `case null 
=> InternalRow.empty` ahead of the type patterns. You're right that Spark 
normally passes a non-null `InternalRow`, but Scala type patterns silently fail 
on `null` and the resulting `MatchError` would be a confusing crash if anything 
ever did pass `null`. Cheap to guard against.
   
   **Comment 4 (magic number, line 551) — FIXED.** Extracted `4096` to a 
private `val COUNT_STAR_BATCH_SIZE` at the class level, with a comment noting 
it matches `spark.sql.parquet.columnarReaderBatchSize`.
   
   **Comment 5 (inline issue refs, line 303 and ~504) — FIXED.** Dropped both 
inline `Tracking: apache/hudi#18769` lines; kept the reference in commit 
messages and PR description as you suggested.
   
   Re-verified locally:
   - `mvn compile -Dspark3.3 / 3.4 / 3.5 -Dscala-2.12` all pass
   - Split-correctness probe: 1000 splits from 100 files → count returns 
expected 1M ✓
   - Standard perf probe: counts correct at scale S (10K) and L (1M); wall 
ratios still close to raw parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

Reply via email to