Github user klion26 commented on the issue: https://github.com/apache/spark/pull/19145 My colleague create a [issue](https://issues.apache.org/jira/browse/YARN-7214) here, I rewrite the description here. Spark Streaming (app1) running on Yarn, app1's one container (c1) runs on NM1. 1. NM1 crashed, and RM found NM1 expired in 10 minutes. 2. RM will remove all containers in NM1(RMNodeImpl). and app1 will receive completed message of c1. But RM can not send c1(to be removed) to NM1 because NM1 lost. 3. NM1 restart and register with RM(c1 in register request), but RM found NM1 is lost and will not handle containers from NM1. 4 NM1 will not heartbeat c1(c1 not in heartbeat request). So c1 will not removed from context of NM1. 5. RM restart, NM1 re register with RM. And c1 will handled and recovered. RM will send c1 completed message to AM of app1. So, app1 received a duplicated completed message of c1. For the fix 1. I changed the code from `completedContainerIdSet.contains(containerId)` to `completedContainerIdSet.remove(containerId)` to reclaim the memory. (The same container will not reported as completed more than twice) 2. The code I added is to ignore the duplicated completed messages, ignore the completed message will avoid requesting new containers.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org