[I] [Feature][Connector-V2] Some optimization suggestions for JDBC source support for string types as partitioning keys [seatunnel]

via GitHub Mon, 25 Dec 2023 18:14:09 -0800


2416210017 opened a new issue, #5432:
URL: https://github.com/apache/seatunnel/issues/5432


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   Some optimization suggestions for JDBC source support for string types as 
partitioning keys：https://github.com/apache/seatunnel/pull/4947
   
   Although the implementation supports string types as partitioning keys, this 
design is not very reasonable. Firstly, it affects the table_ The MD5 hash 
function is applied to each value in the name column, and the obtained hash 
value is modulo 10, followed by an absolute value. Only rows with a result 
equal to 1 will be selected.
   
   For example, the specified partition is 10:
   The actual SQL executed in the business library is:
   
   partition 1：
   SELECT * FROM (
        select * from metastore_bdc.collect_dct_table_info
   ) tt where ABS(MD5(table_name) % 10) = 1;
   
   partition 2：
   SELECT * FROM (
        select * from metastore_bdc.collect_dct_table_info
   ) tt where ABS(MD5(table_name) % 10) = 2;
   。。。
   
   
![a9a1d7da54004dbcb5a088fd91879b9](https://github.com/apache/seatunnel/assets/52597892/26d8c0bb-0322-4f72-91b8-2019464b4571)
   
   As shown in the figure, this type of query runs through the entire table in 
the business library and does not utilize index keys, resulting in no 
performance improvement.
   Suggested reference: Sqoop's method of string segmentation, digitizing 
existing Unicode characters
   
   Reference link：https://blog.csdn.net/fyhailin/article/details/79069475
   
   
   I have a suggestion：
   Fragmentation rules (indexed, not indexed)
   1. The step size of fixed field values is determined by setting the number 
of shards using the maximum and minimum values of the fields. Advantage: Fast 
sharding speed，Disadvantage: Data skew
   
   2. Fixed number of records, starting with the minimum value, sorted to take 
the fixed number of records, calculated the corresponding end position, and 
used the current end position as the starting position for the next sharding 
prediction, iteratively obtaining the sharding rules.
   Advantages: Absolute data balance Disadvantages: Slow sharding prediction
   
   Supplementary explanation:
   Using Scheme 1 without Index
   Using Scheme 2 with indexes (indexes can improve sharding prediction)
   
   ### Usage Scenario
   
   Optimize JDBC source support string type as partition key.
   
   ### Related issues
   
   [Feature][Connector-V2] JDBC source support string type as partition key:
   https://github.com/apache/seatunnel/pull/4947
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature][Connector-V2] Some optimization suggestions for JDBC source support for string types as partitioning keys [seatunnel]

Reply via email to