zuston opened a new pull request, #1652: URL: https://github.com/apache/incubator-uniffle/pull/1652
### What changes were proposed in this pull request? 1. make the write client always use the latest assignment for the following writing when the block reassign happens. 2. support reassign multi servers in one time for huge partitions to load balance to speed up the writing 3. support multi time retry for partition reassign #### Always using the latest assignment To acheive always using the latest assignment, I introduce the `ShuffleHandleInfoWrapper` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `ShuffleHandleInfoWrapper` And it will be updated by the `triggerReassignShuffleServer` rpc. That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`. #### Load balance for huge partition Huge partition is recognize by the `NO_BUFFER_FOR_HUGE_PARTITION` status code that will triggered the multiple servers reassignment. And for the different tasks, the concrete huge partition's writing server is different which is based the taskAttemptID hash value to get the corresponding server from the multiple servers candidates. This will make load balance valid for huge partition. ### Why are the changes needed? This PR is the subtask for #1608. Leverging the #1615 / #1610 / #1609, we have reassign servers when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and `AddBlockEvent` . Besides, the huge partition is limited the writing speed to avoid effecting other normal partitions without this PR. Now, with this PR, we could recognize this case to reassign more servers for huge partitions. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit and integration tests. Integration tests as follows: 1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid 2. `PartitionBlockDataReassignLoadBalanceTest` is to test the partition reassign mechanism of load balance for huge partition 4. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
