yjhjstz opened a new issue, #1603:
URL: https://github.com/apache/cloudberry/issues/1603

   ## Background
   
   Issue #717 tracks a bug where ORCA falls back to the Postgres optimizer when 
column attributes use `COLLATE "C"`. This feature request proposes **actively 
using locale C / collation C** across the database (or enabling it in ORCA 
plans) to gain significant performance improvements, especially for TPC-DS 
workloads.
   
   ## Performance Impact
   
   A detailed benchmark by depesz ([How much speed you're leaving at the table 
if you use default 
locale](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/))
 demonstrates that the `libc/C` collation consistently outperforms 
locale-specific collations:
   
   | Operation | Performance Gain vs default locale |
   |-----------|-----------------------------------|
   | Equality checks on unindexed data | ~50% faster |
   | Range queries | up to 107% faster |
   | Sequential scan comparisons | >20% faster |
   
   Key findings:
   - `libc/C` (collation C) proved fastest in nearly every benchmark
   - Newer builtin providers (e.g., ICU) did **not** match libc/C performance
   - Even sequential scans with builtin providers were >20% slower than C 
collation
   
   ## Why This Matters for TPC-DS
   
   TPC-DS contains many string columns (item descriptions, store names, date 
strings, etc.) that are compared with `=`, `<`, `>`, `ORDER BY`, and `GROUP 
BY`. When ORCA falls back to Postgres planner for these operations due to 
collation issues:
   
   1. ORCA's superior plan shapes (e.g., parallel aggregates, better join 
ordering) are lost
   2. String comparisons use locale-aware collation, incurring significant 
overhead
   3. Benchmark results cannot fairly reflect ORCA's optimization capabilities
   
   ## Current Behavior
   
   As shown in #717, ORCA currently falls back when it encounters `COLLATE "C"` 
columns:
   
   ```sql
   -- ORCA works fine for default collation
   EXPLAIN SELECT * FROM tbl ORDER BY v;
   -- Optimizer: Pivotal Optimizer (GPORCA) ✓
   
   -- ORCA falls back for collate C columns
   EXPLAIN SELECT * FROM tbl_collate_c ORDER BY v;
   -- Optimizer: Postgres query optimizer ✗ (fallback)
   ```
   
   ## Proposed Solution
   
   1. **Fix ORCA to support `COLLATE "C"`** (prerequisite: #717) — allow ORCA 
to generate and execute plans for tables/columns with C collation
   2. **Enable C locale support in ORCA's sort/comparison operators** — ensure 
ORCA can correctly classify C collation as a supported collation for sort keys, 
merge keys, and equality comparisons
   3. **TPC-DS test setup**: consider recommending `LC_COLLATE=C` or `COLLATE 
"C"` for TPC-DS benchmark tables to unlock ORCA's full optimization potential
   
   ## Expected Benefit
   
   - Full ORCA plan coverage for TPC-DS string columns without fallback
   - 50–100%+ speedup on string-heavy sort/scan/join operations
   - More accurate and competitive TPC-DS benchmark results for Cloudberry
   
   ## Related
   
   - Fixes #717 (ORCA fallbacks for collate "C")
   - Reference: 
https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to