Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

via GitHub Mon, 18 May 2026 19:14:03 -0700


linliu-code commented on PR #18770:
URL: https://github.com/apache/hudi/pull/18770#issuecomment-4483839958


   Fixed the CI compile error. Root cause: `ColumnVectorUtils.populate` has 
incompatible signatures across the Spark versions hudi-spark-common compiles 
against:
   
   | Spark version | populate signature |
   |---|---|
   | 3.3.x | `populate(WritableColumnVector, InternalRow, int)` |
   | 3.4.x | `populate(ConstantColumnVector, InternalRow, int)` |
   | 3.5.x | `populate(ConstantColumnVector, InternalRow, int)` |
   
   No single overload works for all three. Replaced the call with a small 
private helper that switches on the partition column's `DataType` and uses 
`ConstantColumnVector`'s primitive setters directly (those have been stable 
across 3.3-3.5). Unsupported partition types fall through to `setNull()` — safe 
for count(*) since partition predicates are applied at planning by the 
FileIndex, not by reading these vectors at execution.
   
   Verified locally:
   - `mvn compile -Dspark3.3 -Dscala-2.12` passes
   - `mvn compile -Dspark3.4 -Dscala-2.12` passes
   - `mvn compile -Dspark3.5 -Dscala-2.12` passes
   - runtime: `count=10,000` (scale S) and `count=1,000,000` (scale L) correct; 
wall ratio 1.26× / 1.20× vs raw parquet.
   
   Pushed as fixup commit on the same branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

Reply via email to