Github user klion26 commented on the issue:
https://github.com/apache/spark/pull/19145
My colleague create a
[issue](https://issues.apache.org/jira/browse/YARN-7214) here, I rewrite the
description here.
Spark Streaming (app1) running on Yarn, app1's one container (c1) runs on
NM1.
1. NM1 crashed, and RM found NM1 expired in 10 minutes.
2. RM will remove all containers in NM1(RMNodeImpl). and app1 will receive
completed message of c1. But RM can not send c1(to be removed) to NM1 because
NM1 lost.
3. NM1 restart and register with RM(c1 in register request), but RM found
NM1 is lost and will not handle containers from NM1.
4 NM1 will not heartbeat c1(c1 not in heartbeat request). So c1 will not
removed from context of NM1.
5. RM restart, NM1 re register with RM. And c1 will handled and recovered.
RM will send c1 completed message to AM of app1. So, app1 received a duplicated
completed message of c1.
For the fix
1. I changed the code from `completedContainerIdSet.contains(containerId)`
to `completedContainerIdSet.remove(containerId)` to reclaim the memory. (The
same container will not reported as completed more than twice)
2. The code I added is to ignore the duplicated completed messages, ignore
the completed message will avoid requesting new containers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]