ziyanTOP commented on PR #4413: URL: https://github.com/apache/flink-cdc/pull/4413#issuecomment-4571114699
@lvyanquan Thanks for the review and the great suggestion. I agree that automatically detecting the collation is the ideal long-term experience. However, there are a few practical constraints that make a hard-coded auto-detection difficult in this PR: 1. **Scope mismatch**: `SHOW VARIABLES LIKE 'collation_server'` only returns the **server-level** default. In MySQL, collation can be overridden at the database, table, and even **column** level. The chunk-split logic actually needs the collation of the specific chunk-key column, not the server default. 2. **Multi-table overhead**: A Pipeline job often captures dozens of tables. To auto-detect correctly, we would need to query `information_schema.COLUMNS` for every table during snapshot initialization, map each MySQL collation name to a Java comparison strategy, and handle mixed collations for composite primary keys. This adds non-trivial startup latency and state complexity. 3. **User override**: Some users may want to force a specific comparison semantics regardless of the MySQL collation (e.g., for performance tuning or cross-version compatibility). Therefore, the current explicit configuration is the safest and most backward-compatible fix. That said, I think adding an `auto` mode is a valuable **follow-up enhancement** — we can automatically detect each table's chunk-key column collation at snapshot time, persist the per-table compare mode in the split state, and fall back to `default` when detection fails. I'd be happy to create a separate Jira ticket and PR for that. Please let me know if the current approach looks acceptable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
