sodonnel commented on PR #5293: URL: https://github.com/apache/ozone/pull/5293#issuecomment-1719495667
OK - so say a container is under replicated. We send out 2 deletes to the 2 DNs. Then a 3rd replica appears. It did not get sent the delete, but now the logic expects that it needs 3 replies and it will never get that 3rd reply. In this case, could you argue that after some reasonable timeout, the delete should be sent again? If a DN receives a delete for a block is has not got, it will reply with an OK, right? So after a timeout we try again and it cleans up. What about, we have 3 replicas. We send 3 deletes, then one node goes down making it under replicated. We never get a reply from that node. Replication makes a 3rd copy before checking. This was not under-replicated at the start and the change here would not help I think. It feels like we should be storing the DNs we sent the commands to, and checking off the replies against what we sent, rather than expecting replies from "all the current replica DNs", as the current replicas could have changed between when the command was issued and when the replies were received. I think there should be a timeout, after which it discards the pending commands, and sends again to the current replica list and waits for a reply there. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
