2416210017 opened a new issue, #5432: URL: https://github.com/apache/seatunnel/issues/5432
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description Some optimization suggestions for JDBC source support for string types as partitioning keys:https://github.com/apache/seatunnel/pull/4947 Although the implementation supports string types as partitioning keys, this design is not very reasonable. Firstly, it affects the table_ The MD5 hash function is applied to each value in the name column, and the obtained hash value is modulo 10, followed by an absolute value. Only rows with a result equal to 1 will be selected. For example, the specified partition is 10: The actual SQL executed in the business library is: partition 1: SELECT * FROM ( select * from metastore_bdc.collect_dct_table_info ) tt where ABS(MD5(table_name) % 10) = 1; partition 2: SELECT * FROM ( select * from metastore_bdc.collect_dct_table_info ) tt where ABS(MD5(table_name) % 10) = 2; 。。。  As shown in the figure, this type of query runs through the entire table in the business library and does not utilize index keys, resulting in no performance improvement. Suggested reference: Sqoop's method of string segmentation, digitizing existing Unicode characters Reference link:https://blog.csdn.net/fyhailin/article/details/79069475 I have a suggestion: Fragmentation rules (indexed, not indexed) 1. The step size of fixed field values is determined by setting the number of shards using the maximum and minimum values of the fields. Advantage: Fast sharding speed,Disadvantage: Data skew 2. Fixed number of records, starting with the minimum value, sorted to take the fixed number of records, calculated the corresponding end position, and used the current end position as the starting position for the next sharding prediction, iteratively obtaining the sharding rules. Advantages: Absolute data balance Disadvantages: Slow sharding prediction Supplementary explanation: Using Scheme 1 without Index Using Scheme 2 with indexes (indexes can improve sharding prediction) ### Usage Scenario Optimize JDBC source support string type as partition key. ### Related issues [Feature][Connector-V2] JDBC source support string type as partition key: https://github.com/apache/seatunnel/pull/4947 ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
