ic4y opened a new pull request, #4856:
URL: https://github.com/apache/seatunnel/pull/4856

   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code 
changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
   
     - Make sure that the pull request corresponds to a [GITHUB 
issue](https://github.com/apache/seatunnel/issues).
   
     - Name the pull request in the form "[Feature] [component] Title of the 
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
   
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix 
typo in README.md doc`.
   
   -->
   
   ## Purpose of this pull request
   
   <!-- Describe the purpose of this pull request. For example: This pull 
request adds checkstyle plugin.-->
   
   This commit introduces a new sharding strategy based on data sampling. This 
strategy is invoked when the distribution factor of the data falls outside the 
specified upper and lower bounds, and the estimated shard count exceeds a 
specified threshold (default is 1000 shards).
   
   The sampling rate is configurable via the INVERSE_SAMPLING_RATE parameter. 
For example, a value of 1000 means a 1/1000 sampling rate is applied.
   
   In such cases, the system performs data sampling and creates shards based on 
the sampled data. This approach can handle large datasets more efficiently and 
can potentially reduce query execution time and resource usage.
   
   Changes include:
   
   Added new SAMPLE_SHARDING_THRESHOLD configuration option with a default 
value of 1000 shards.
   Added new INVERSE_SAMPLING_RATE configuration option to allow users to 
specify the sampling rate for this sharding strategy.
   Implemented sampleDataFromColumn method for performing data sampling and 
shard creation.
   Updated the sharding logic to invoke the sample-based strategy when 
conditions are met.
   Updated relevant documentation and comments to reflect these changes.
   This feature enhances the flexibility of our sharding strategies, especially 
for handling large datasets.
   
   ## Check list
   
   * [ ] Code changed are covered with tests, or it does not need tests for 
reason:
   * [ ] If any new Jar binary package adding in your PR, please add License 
Notice according
     [New License 
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new 
feature. https://github.com/apache/seatunnel/tree/dev/docs
   * [ ] If you are contributing the connector code, please check that the 
following files are updated:
     1. Update change log that in connector document. For more details you can 
refer to 
[connector-v2](https://github.com/apache/seatunnel/tree/dev/docs/en/connector-v2)
     2. Update 
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
 and add new connector information in it
     3. Update the pom file of 
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
   * [ ] Update the 
[`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to