[
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Taras Ledkov updated IGNITE-3558:
---------------------------------
Description:
The test to reproduce:
{{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
*Root cause*
{{GridJobExecuteResponse}} isn't set from target node because there is a
confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
*Suggestion*
The method {{GridJobProcessor.CollisionJobContext.cancel()}}
use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}.
*passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}} implements
as a equation of jobId.
So, when two thread try to cancel the two workers with *the same jobIds* we
have the case:
- thread0 remove jobWorker0 & cancel jobWorker0.
- thread0 put jobWorker1 (because jobWorker0 already removed);
- thread1: (has a copy of jobWorker0) and try to cancel it.
- thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to
identify);
- thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
*Proposal*
Try to use system default equals for the GridJobWorker
was:
The test to reproduce:
IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing
*Root cause*
GridJobExecuteResponse isn't set from target node because there is a confusion
with GridJobWorker instances in the CollisionContext.
*Suggestion*
The method GridJobProcessor.CollisionJobContext.cancel()
use passiveJobs.remove(jobWorker.getJobId(), jobWorker).
*passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as a
equation of jobId.
So, when two thread try to cancel the two workers with *the same jobIds* we
have the case:
- thread0 remove jobWorker0 & cancel jobWorker0.
- thread0 put jobWorker1 (because jobWorker0 already removed);
- thread1: (has a copy of jobWorker0) and try to cancel it.
- thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to
identify);
- thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
*Proposal*
Try to use system default equals for the GridJobWorker
> Affinity task hangs when Collision SPI produces a lot of job rejections &
> Failover SPI produces many attempts
> -------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
> Issue Type: Bug
> Components: compute
> Reporter: Taras Ledkov
> Assignee: Taras Ledkov
> Fix For: 2.0
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
> The test to reproduce:
> {{IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest.testJobFinishing}}
> *Root cause*
> {{GridJobExecuteResponse}} isn't set from target node because there is a
> confusion with {{GridJobWorker}} instances in the {{CollisionContext}}.
> *Suggestion*
> The method {{GridJobProcessor.CollisionJobContext.cancel()}}
> use {{passiveJobs.remove(jobWorker.getJobId(), jobWorker)}}.
> *passiveJobs* is a ConcurrentHashMap and {{GridJobWorker.equals()}}
> implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)