yjhjstz opened a new issue, #1603: URL: https://github.com/apache/cloudberry/issues/1603
## Background Issue #717 tracks a bug where ORCA falls back to the Postgres optimizer when column attributes use `COLLATE "C"`. This feature request proposes **actively using locale C / collation C** across the database (or enabling it in ORCA plans) to gain significant performance improvements, especially for TPC-DS workloads. ## Performance Impact A detailed benchmark by depesz ([How much speed you're leaving at the table if you use default locale](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/)) demonstrates that the `libc/C` collation consistently outperforms locale-specific collations: | Operation | Performance Gain vs default locale | |-----------|-----------------------------------| | Equality checks on unindexed data | ~50% faster | | Range queries | up to 107% faster | | Sequential scan comparisons | >20% faster | Key findings: - `libc/C` (collation C) proved fastest in nearly every benchmark - Newer builtin providers (e.g., ICU) did **not** match libc/C performance - Even sequential scans with builtin providers were >20% slower than C collation ## Why This Matters for TPC-DS TPC-DS contains many string columns (item descriptions, store names, date strings, etc.) that are compared with `=`, `<`, `>`, `ORDER BY`, and `GROUP BY`. When ORCA falls back to Postgres planner for these operations due to collation issues: 1. ORCA's superior plan shapes (e.g., parallel aggregates, better join ordering) are lost 2. String comparisons use locale-aware collation, incurring significant overhead 3. Benchmark results cannot fairly reflect ORCA's optimization capabilities ## Current Behavior As shown in #717, ORCA currently falls back when it encounters `COLLATE "C"` columns: ```sql -- ORCA works fine for default collation EXPLAIN SELECT * FROM tbl ORDER BY v; -- Optimizer: Pivotal Optimizer (GPORCA) ✓ -- ORCA falls back for collate C columns EXPLAIN SELECT * FROM tbl_collate_c ORDER BY v; -- Optimizer: Postgres query optimizer ✗ (fallback) ``` ## Proposed Solution 1. **Fix ORCA to support `COLLATE "C"`** (prerequisite: #717) — allow ORCA to generate and execute plans for tables/columns with C collation 2. **Enable C locale support in ORCA's sort/comparison operators** — ensure ORCA can correctly classify C collation as a supported collation for sort keys, merge keys, and equality comparisons 3. **TPC-DS test setup**: consider recommending `LC_COLLATE=C` or `COLLATE "C"` for TPC-DS benchmark tables to unlock ORCA's full optimization potential ## Expected Benefit - Full ORCA plan coverage for TPC-DS string columns without fallback - 50–100%+ speedup on string-heavy sort/scan/join operations - More accurate and competitive TPC-DS benchmark results for Cloudberry ## Related - Fixes #717 (ORCA fallbacks for collate "C") - Reference: https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
