GitHub user yjhjstz created a discussion: Using locale C / COLLATE "C" to unlock ORCA performance for TPC-DS benchmarks
## Motivation When running TPC-DS benchmarks on Cloudberry, we noticed that queries on string columns (item descriptions, store names, customer names, etc.) often cause **ORCA to fall back to the Postgres planner**. The root cause is that ORCA currently does not support columns with `COLLATE "C"`, as tracked in [issue #717](https://github.com/apache/cloudberry/issues/717). Beyond fixing the fallback, there is a broader performance opportunity: locale C is significantly faster than locale-aware collations for string comparisons. ## Performance Evidence A detailed benchmark published by depesz — [How much speed you are leaving at the table if you use default locale](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/) — shows: | Operation | Speedup with locale C vs default locale | |-----------|------------------------------------------| | Equality checks on unindexed data | ~50% faster | | Range queries | up to **107% faster** | | Sequential scan comparisons | >20% faster | Key takeaways: - `libc/C` collation is the fastest across nearly every benchmark - Even the newer ICU and builtin providers lag significantly behind C - The overhead of locale-aware collation accumulates heavily in analytical workloads like TPC-DS ## Current Behavior in Cloudberry / ORCA ```sql -- Table with default collation: ORCA handles it EXPLAIN SELECT * FROM tbl ORDER BY v; -- Optimizer: Pivotal Optimizer (GPORCA) -- Table with COLLATE "C": ORCA falls back EXPLAIN SELECT * FROM tbl_collate_c ORDER BY v; -- Optimizer: Postgres query optimizer (fallback!) ``` This means for TPC-DS string columns defined with COLLATE "C", we lose ORCA's superior join ordering, parallel aggregation plans, and better sort/merge strategies. ## Proposal We have opened [issue #1603](https://github.com/apache/cloudberry/issues/1603) to track the work. The proposed steps are: 1. **Fix ORCA to support COLLATE "C"** (prerequisite: #717) — teach ORCA to recognize and use C collation in sort keys, merge keys, and equality operators 2. **Ensure ORCA's string comparison operators are collation-aware** — avoid incorrect plans when mixing collations 3. **Recommend C locale for TPC-DS test tables** — document or script the best-practice setup so benchmarks reflect ORCA's full optimization capability ## Discussion Questions - Should Cloudberry's default cluster initialization (`initdb`) recommend or default to `LC_COLLATE=C` for analytical workloads? - Are there correctness concerns with C collation in a distributed MPP setting (e.g., segment-level sort merge across different OS locales)? - What is the right approach for ORCA to handle multiple collations — treat them as opaque and fall back, or model them explicitly in the optimizer? - Has anyone already tested TPC-DS with `LC_COLLATE=C` on Cloudberry? What were your findings? Would love to hear thoughts from the community, especially from contributors familiar with ORCA internals and anyone who has run TPC-DS benchmarks on Cloudberry. ## References - [depesz benchmark: locale C performance](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/) - [Issue #717: ORCA fallbacks for collate "C"](https://github.com/apache/cloudberry/issues/717) - [Issue #1603: Feature request to support locale C in ORCA for TPC-DS](https://github.com/apache/cloudberry/issues/1603) GitHub link: https://github.com/apache/cloudberry/discussions/1604 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
