[
https://issues.apache.org/jira/browse/SOLR-16412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ishan Chattopadhyaya reassigned SOLR-16412:
-------------------------------------------
Assignee: Ishan Chattopadhyaya
> Race condition could trigger error on concurrent SizeLimitedDistributedMap
> cleanup
> ----------------------------------------------------------------------------------
>
> Key: SOLR-16412
> URL: https://issues.apache.org/jira/browse/SOLR-16412
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 8.8, main (10.0)
> Reporter: Patson Luk
> Assignee: Ishan Chattopadhyaya
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h2. Description
> Exception below is observed while updating the `completedMap` field in
> `OverseerTaskProcessor` :
> {{o.a.s.c.OverseerTaskProcessor
> :org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for
> /overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331}}
> {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)}}
> {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)}}
> {{at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)}}
> {{at
> org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)}}
> {{at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)}}
> {{at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)}}
> {{at
> org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)}}
> {{at
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)}}
> {{at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)}}
> {{at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> h2. Cause
> Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit
> and attempted to cleanup entries:
> [https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80]
> However, when it performs the actual deletion, it failed with
> `NoNodeException`
> This is likely caused by race condition as multiple threads can enter the
> same code block and try to delete same list of children which the slower
> threads can delete on child node that no longer exists.
>
> Such condition can be reproduced by unit test case, which will be included in
> the PR
> h2. Solution
> Although we could enforce synchronization to prevent threads from purging the
> same set of child nodes, it might not be desirable to add extra blocking.
> Instead, it's probably safe to ignore the `KeeperException.NoNodeException`
> if such node is no longer there for the purge operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]