wangshengjie123 commented on code in PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#discussion_r1564540946
##########
client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:
##########
@@ -135,30 +136,37 @@ protected Compressor initialValue() {
private final ReviveManager reviveManager;
+ private final boolean dataPushFailureTrackingEnabled;
+
protected static class ReduceFileGroups {
public Map<Integer, Set<PartitionLocation>> partitionGroups;
+ public Map<String, Set<PushFailedBatch>> pushFailedBatches;
Review Comment:
I'm not quite sure if I fully understand your meaning, do you mean:
- `the data would be under replicated`, this means data will be lost ? I
think data will not be lost, because failed batch will be retried to another
PartitionLocation requested by Revive with epoch + 1
- `If yes, wouldn't it not be simply better to retry to the same (or perhaps
updated) peer ?` Revive is designed to avoid retry push data to the same
PartitionLocation, maybe sometimes retry the same node will not success or
timeout. And we want to ensure that one Succeed batch should be available both
Primary and Replicate Peer
If I misunderstood, could you kindly explain further and provide an example?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]