[ 
https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taras Ledkov updated IGNITE-3558:
---------------------------------
    Fix Version/s: 1.7

> Affinity task hangs when Collision SPI produces a lot of job rejections & 
> Failover SPI produces many attempts
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-3558
>                 URL: https://issues.apache.org/jira/browse/IGNITE-3558
>             Project: Ignite
>          Issue Type: Bug
>          Components: compute
>            Reporter: Taras Ledkov
>            Assignee: Taras Ledkov
>             Fix For: 1.7
>
>
> The test to reproduce:
> IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing
> *Root cause*
> GridJobExecuteResponse isn't set from target node because there is a 
> confusion with GridJobWorker instances in the CollisionContext.
> *Suggestion*
> The method GridJobProcessor.CollisionJobContext.cancel()
> use passiveJobs.remove(jobWorker.getJobId(), jobWorker). 
> *passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as 
> a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we 
> have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to 
> identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to