sodonnel commented on PR #4618: URL: https://github.com/apache/ozone/pull/4618#issuecomment-1523001506
Its worth noting here that there have been changes in the Replication Manager which means it will not keep adding and adding to the replication queue on the datanode, so problems where 1000's of replication commands end up on the datanode should not happen any more. Also, the datanode now drops the commands if they are too old for both replication and delete container commands. I believe you saw some scenarios where the DN queue had a lot of message resulting in a crash. Do you have a breakdown of what commands were queued and where? Eg replication commands on the replication queue, what about the commandQueue where commands get placed first - was it also suffering from many messages stuck on it? I don't see any new tests in this PR - if we are going to have logic to enforce limits we probably need a couple of simple tests to check they are working ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
