wsry opened a new pull request #18505:
URL: https://github.com/apache/flink/pull/18505
## What is the purpose of the change
Currently, for result partition of sort-shuffle, there is extra record copy
overhead Introduced by clustering records by subpartition index. For small
records, this overhead can cause even 20% performance regression. This ticket
aims to solve the problem.
In fact, the hash-based implementation is a nature way to achieve the goal
of sorting records by partition index. However, it incurs some serious
weaknesses. For example, when there is no enough buffers or there is data skew,
it can waste buffers and influence compression efficiency which can cause
performance regression.
This ticket tries to solve the issue by dynamically switching between the
two implementations, that is, if there are enough buffers, the hash-based
implementation will be used and if there is no enough buffers, the sort-based
implementation will be used.
## Brief change log
- Dynamically switching between the two implementations, that is, if there
are enough buffers, the hash-based implementation will be used and if there is
no enough buffers, the sort-based implementation will be used.
## Verifying this change
This change added tests and existing tests can also help to verify the
change.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / **no**)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (yes / **no**)
- The serializers: (yes / **no** / don't know)
- The runtime per-record code paths (performance sensitive): (yes / **no**
/ don't know)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't
know)
- The S3 file system connector: (yes / **no** / don't know)
## Documentation
- Does this pull request introduce a new feature? (yes / **no**)
- If yes, how is the feature documented? (**not applicable** / docs /
JavaDocs / not documented)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]