[jira] [Updated] (FLINK-13421) Unexpected ConcurrentModificationException when RM notify JM about allocation failure

Zhu Zhu (JIRA) Thu, 25 Jul 2019 01:05:09 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhu Zhu updated FLINK-13421:
----------------------------
    Description: 
When a TM lost and RM identified it first, it will notify JM about it through 
JobMaster#notifyAllocationFailure.

We observed unexpected ConcurrentModificationException in this process, stack 
as below:

 

Caused by: java.util.ConcurrentModificationException

        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)

        at java.util.HashMap$ValueIterator.next(HashMap.java:1466)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477)

        at 
org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692)

        at 
org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538)

        at 
org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664)

        ... 26 more

 

This can cause an allocated slot to be removed from SlotPool#allocatedSlots but 
not all of its payload tasks get failed. Tasks may hang in scheduled forever, 
not able to fail or timeout, as in the attached log.

It is not figured out yet that how a concurrent modification can happen. We do 
not have a debug log for it and is not able to re-pro it with debug log enabled 
yet.

However, we can let SlotSharingManager$MultiTaskSlot do not iterate on its 
children directly to avoid ConcurrentModificationException to occur in any case.

  was:
When a TM lost and RM identified it first, it will notify JM about it through 
JobMaster#notifyAllocationFailure. 

We observed unexpected ConcurrentModificationException in this process, stack 
as below:

 

Caused by: java.util.ConcurrentModificationException

        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)

        at java.util.HashMap$ValueIterator.next(HashMap.java:1466)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477)

        at 
org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712)

        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692)

        at 
org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538)

        at 
org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664)

        ... 26 more

 

This can cause a allocated slot removed from SlotPool#allocatedSlots but not 
all of its payload tasks get failed. Tasks may hang in scheduled forever, as in 
the attached log.

It is not figured out yet that how a concurrent modification can happen. We do 
not have a debug log for it and is not able to re-pro it with debug log enabled 
yet.

However, we can let SlotSharingManager$MultiTaskSlot do not iterate on its 
children directly to avoid ConcurrentModificationException to occur in any case.


> Unexpected ConcurrentModificationException when RM notify JM about allocation 
> failure
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-13421
>                 URL: https://issues.apache.org/jira/browse/FLINK-13421
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Zhu Zhu
>            Priority: Blocker
>
> When a TM lost and RM identified it first, it will notify JM about it through 
> JobMaster#notifyAllocationFailure.
> We observed unexpected ConcurrentModificationException in this process, stack 
> as below:
>  
> Caused by: java.util.ConcurrentModificationException
>         at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>         at java.util.HashMap$ValueIterator.next(HashMap.java:1466)
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:477)
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:712)
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:692)
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:538)
>         at 
> org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:664)
>         ... 26 more
>  
> This can cause an allocated slot to be removed from SlotPool#allocatedSlots 
> but not all of its payload tasks get failed. Tasks may hang in scheduled 
> forever, not able to fail or timeout, as in the attached log.
> It is not figured out yet that how a concurrent modification can happen. We 
> do not have a debug log for it and is not able to re-pro it with debug log 
> enabled yet.
> However, we can let SlotSharingManager$MultiTaskSlot do not iterate on its 
> children directly to avoid ConcurrentModificationException to occur in any 
> case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (FLINK-13421) Unexpected ConcurrentModificationException when RM notify JM about allocation failure

Reply via email to