yaooqinn opened a new pull request, #12138:
URL: https://github.com/apache/gluten/pull/12138

   [GLUTEN-3456][VL] Enable columnar table cache by default and extend 
benchmark coverage
   
   ### What changes were proposed in this pull request?
   
   1. Flip `spark.gluten.sql.columnar.tableCache` default from `false` to 
`true`.
   2. Extend `ColumnarTableCacheBenchmark` to cover:
      - 3 sources: `parquet` (Velox-native columnar), `csv`, `json` (row-based 
fallback per GLUTEN-3456).
      - 2 schema shapes: 5-col numeric mix, and a 16-col x ~200-char 
wide-string shape (the GLUTEN-3488 hazard).
      - Cases: `count` / column-pruning / filter for numeric; `count` for 
wide-string.
   3. Regenerate `ColumnarTableCacheBenchmark-results.txt` with the new matrix.
   
   ### Why are the changes needed?
   
   GLUTEN-3456 raised the concern that for row-based sources (csv/json) the
   Velox columnar cache would lose to vanilla Spark because of the R2C/C2R
   conversion tax. The extended benchmark shows the opposite is true on
   Velox today:
   
   | Case                                  | disable (ms) | enable (ms) | 
speedup |
   
|---------------------------------------|-------------:|------------:|--------:|
   | numeric/parquet count                 |        17377 |        2975 |   
5.84x |
   | numeric/parquet column pruning        |        20768 |        3778 |   
5.50x |
   | numeric/parquet filter                |        22681 |        4242 |   
5.35x |
   | numeric/csv count                     |        40502 |       30146 |   
1.34x |
   | numeric/csv column pruning            |        42245 |       30667 |   
1.38x |
   | numeric/csv filter                    |        43929 |       31077 |   
1.41x |
   | numeric/json count                    |        44659 |       28467 |   
1.57x |
   | numeric/json column pruning           |        46961 |       29230 |   
1.61x |
   | numeric/json filter                   |        49106 |       29061 |   
1.69x |
   | wide-string/parquet count             |        40888 |       11863 |   
3.45x |
   | wide-string/csv count                 |        82433 |       86708 |   
0.95x |
   | wide-string/json count                |        70729 |       54856 |   
1.29x |
   
   11 / 12 cases improve. The only regression is `wide-string/csv count`
   (-5%), where the Arrow CSV scan + R2C cost on a 16-col x ~200-char shape
   slightly outweighs the cache benefit. Given how narrow that corner is,
   flipping the default to `true` is the right trade-off; users hitting
   that shape can still set `spark.gluten.sql.columnar.tableCache=false`.
   
   Hardware: `Intel(R) Xeon(R) Platinum 8473C`, Linux WSL2, JDK 17. Numbers
   are 3-iter Best Time from the regenerated golden file.
   
   ### How was this patch tested?
   
   - Re-ran `ColumnarTableCacheBenchmark` in three modes (vanilla / Gluten
     off / Gluten on) and regenerated the golden file.
   - `./build/mvn -Pbackends-velox -Pspark-3.5 spotless:apply` clean.
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to