yaooqinn opened a new pull request, #12138:
URL: https://github.com/apache/gluten/pull/12138
[GLUTEN-3456][VL] Enable columnar table cache by default and extend
benchmark coverage
### What changes were proposed in this pull request?
1. Flip `spark.gluten.sql.columnar.tableCache` default from `false` to
`true`.
2. Extend `ColumnarTableCacheBenchmark` to cover:
- 3 sources: `parquet` (Velox-native columnar), `csv`, `json` (row-based
fallback per GLUTEN-3456).
- 2 schema shapes: 5-col numeric mix, and a 16-col x ~200-char
wide-string shape (the GLUTEN-3488 hazard).
- Cases: `count` / column-pruning / filter for numeric; `count` for
wide-string.
3. Regenerate `ColumnarTableCacheBenchmark-results.txt` with the new matrix.
### Why are the changes needed?
GLUTEN-3456 raised the concern that for row-based sources (csv/json) the
Velox columnar cache would lose to vanilla Spark because of the R2C/C2R
conversion tax. The extended benchmark shows the opposite is true on
Velox today:
| Case | disable (ms) | enable (ms) |
speedup |
|---------------------------------------|-------------:|------------:|--------:|
| numeric/parquet count | 17377 | 2975 |
5.84x |
| numeric/parquet column pruning | 20768 | 3778 |
5.50x |
| numeric/parquet filter | 22681 | 4242 |
5.35x |
| numeric/csv count | 40502 | 30146 |
1.34x |
| numeric/csv column pruning | 42245 | 30667 |
1.38x |
| numeric/csv filter | 43929 | 31077 |
1.41x |
| numeric/json count | 44659 | 28467 |
1.57x |
| numeric/json column pruning | 46961 | 29230 |
1.61x |
| numeric/json filter | 49106 | 29061 |
1.69x |
| wide-string/parquet count | 40888 | 11863 |
3.45x |
| wide-string/csv count | 82433 | 86708 |
0.95x |
| wide-string/json count | 70729 | 54856 |
1.29x |
11 / 12 cases improve. The only regression is `wide-string/csv count`
(-5%), where the Arrow CSV scan + R2C cost on a 16-col x ~200-char shape
slightly outweighs the cache benefit. Given how narrow that corner is,
flipping the default to `true` is the right trade-off; users hitting
that shape can still set `spark.gluten.sql.columnar.tableCache=false`.
Hardware: `Intel(R) Xeon(R) Platinum 8473C`, Linux WSL2, JDK 17. Numbers
are 3-iter Best Time from the regenerated golden file.
### How was this patch tested?
- Re-ran `ColumnarTableCacheBenchmark` in three modes (vanilla / Gluten
off / Gluten on) and regenerated the golden file.
- `./build/mvn -Pbackends-velox -Pspark-3.5 spotless:apply` clean.
Generated-by: Claude Opus 4.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]