linliu-code commented on issue #18769:
URL: https://github.com/apache/hudi/issues/18769#issuecomment-4475772337

   ## Update: cross-version replay — same issue in 0.15.0 and 0.15.1-rc1
   
   Ran the same probe (100 partitions × 10K rows COW, count(*) with MDT + 
col-stats + data skipping enabled) against three Hudi bundles, all Spark 3.4.3 
/ Scala 2.12 / Java 11:
   
   | Version | Files | On-disk | Wall (median) | bytesRead (median) | Amp vs 
disk |
   |---|---|---|---|---|---|
   | 0.15.0 | 100 | 51.0 MB | 177 ms | 84.3 MB | 1.65× |
   | 0.15.1-rc1 | 100 | 51.1 MB | 168 ms | 84.3 MB | 1.65× |
   | 1.1.1 | 100 | 51.0 MB | 209 ms | 84.3 MB | 1.65× |
   
   Raw parquet baseline at this scale (from the body's measurements): bytesRead 
≈ 376 KB, so the bytesRead-vs-raw ratio is ~224× for all three Hudi versions.
   
   **Two takeaways:**
   
   1. **Not a 1.x regression.** The missing count(*) fast-path goes back to at 
least 0.15.0. The implementation moved from `HoodieParquetFileFormat` (0.15.x) 
to `HoodieFileGroupReaderBasedFileFormat` (1.x), but neither version 
short-circuits on `requiredSchema.isEmpty`. If a backport is desired, the 
0.15.x reader needs an analogous fix.
   
   2. **1.1.1 has ~20% more wall at the same bytesRead** vs 0.15.x. Likely CPU 
overhead in the new file-group-reader wrapper path, not a bytesRead difference. 
Probably worth a separate look but secondary to this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to