[GitHub] [seatunnel] ic4y opened a new pull request, #4856: [improve][CDC base] Implement Sample-based Sharding Strategy with Configurable Sampling Rate

via GitHub Mon, 29 May 2023 10:06:49 -0700


ic4y opened a new pull request, #4856:
URL: https://github.com/apache/seatunnel/pull/4856

<!--

Thank you for contributing to SeaTunnel! Please make sure that your code
changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

## Contribution Checklist

- Make sure that the pull request corresponds to a [GITHUB
issue](https://github.com/apache/seatunnel/issues).

- Name the pull request in the form "[Feature] [component] Title of the
pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.

- Minor fixes should be named following this pattern: `[hotfix] [docs] Fix
typo in README.md doc`.

-->

## Purpose of this pull request

This commit introduces a new sharding strategy based on data sampling. This
strategy is invoked when the distribution factor of the data falls outside the
specified upper and lower bounds, and the estimated shard count exceeds a
specified threshold (default is 1000 shards).

The sampling rate is configurable via the INVERSE_SAMPLING_RATE parameter.
For example, a value of 1000 means a 1/1000 sampling rate is applied.

In such cases, the system performs data sampling and creates shards based on
the sampled data. This approach can handle large datasets more efficiently and
can potentially reduce query execution time and resource usage.

Changes include:

Added new SAMPLE_SHARDING_THRESHOLD configuration option with a default
value of 1000 shards.
Added new INVERSE_SAMPLING_RATE configuration option to allow users to
specify the sampling rate for this sharding strategy.
Implemented sampleDataFromColumn method for performing data sampling and
shard creation.
Updated the sharding logic to invoke the sample-based strategy when
conditions are met.
Updated relevant documentation and comments to reflect these changes.
This feature enhances the flexibility of our sharding strategies, especially
for handling large datasets.

## Check list

* [ ] Code changed are covered with tests, or it does not need tests for
reason:
* [ ] If any new Jar binary package adding in your PR, please add License
Notice according
[New License
Guide](https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md)
* [ ] If necessary, please update the documentation to describe the new
feature. https://github.com/apache/seatunnel/tree/dev/docs
* [ ] If you are contributing the connector code, please check that the
following files are updated:
1. Update change log that in connector document. For more details you can
refer to
[connector-v2](https://github.com/apache/seatunnel/tree/dev/docs/en/connector-v2)
2. Update
[plugin-mapping.properties](https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties)
and add new connector information in it
3. Update the pom file of
[seatunnel-dist](https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml)
* [ ] Update the
[`release-note`](https://github.com/apache/seatunnel/blob/dev/release-note.md).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [seatunnel] ic4y opened a new pull request, #4856: [improve][CDC base] Implement Sample-based Sharding Strategy with Configurable Sampling Rate

Reply via email to