linliu-code commented on PR #18770:
URL: https://github.com/apache/hudi/pull/18770#issuecomment-4483891741
Thanks for the careful review! I addressed all five comments in commit
9701cb2. Per-comment responses:
**Comment 1 (splittable files, line 543) — FIXED.** This was a real
correctness bug. The footer's row groups are now filtered to only those whose
`getStartingPos` falls inside the current split:
```scala
val splitStart = file.start
val splitEnd = file.start + file.length
val rowCount = footer.getBlocks.asScala
.filter(b => b.getStartingPos >= splitStart && b.getStartingPos < splitEnd)
.foldLeft(0L)((acc, b) => acc + b.getRowCount)
```
This matches the criterion `VectorizedParquetRecordReader` uses internally.
Verified locally: with `spark.sql.files.maxPartitionBytes=64KB` and
`hoodie.parquet.block.size=32KB`, my 100 base files split into 1000 Spark
partitions, and `count(*)` still returns the expected 1,000,000 (would be ~10×
without the filter).
**Comment 2 (ORC/Lance, line 541) — FIXED.** Added a file-format gate
alongside `isCount`:
```scala
val canUseCountFastPath = isCount &&
hoodieFileFormat == HoodieFileFormat.PARQUET &&
!isMultipleBaseFileFormatsEnabled
```
ORC and Lance COW tables, plus mixed-format tables, now fall through to
`readBaseFile` (the existing code path).
**Comment 3 (null match, line 545) — FIXED.** Added an explicit `case null
=> InternalRow.empty` ahead of the type patterns. You're right that Spark
normally passes a non-null `InternalRow`, but Scala type patterns silently fail
on `null` and the resulting `MatchError` would be a confusing crash if anything
ever did pass `null`. Cheap to guard against.
**Comment 4 (magic number, line 551) — FIXED.** Extracted `4096` to a
private `val COUNT_STAR_BATCH_SIZE` at the class level, with a comment noting
it matches `spark.sql.parquet.columnarReaderBatchSize`.
**Comment 5 (inline issue refs, line 303 and ~504) — FIXED.** Dropped both
inline `Tracking: apache/hudi#18769` lines; kept the reference in commit
messages and PR description as you suggested.
Re-verified locally:
- `mvn compile -Dspark3.3 / 3.4 / 3.5 -Dscala-2.12` all pass
- Split-correctness probe: 1000 splits from 100 files → count returns
expected 1M ✓
- Standard perf probe: counts correct at scale S (10K) and L (1M); wall
ratios still close to raw parquet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]