[ https://issues.apache.org/jira/browse/SOLR-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170811#comment-16170811 ]
Erick Erickson commented on SOLR-10181: --------------------------------------- Assigning to myself to not lose track of it, feel free to take it if you've a special interest. > CREATEALIAS and DELETEALIAS commands consistency problems under concurrency > --------------------------------------------------------------------------- > > Key: SOLR-10181 > URL: https://issues.apache.org/jira/browse/SOLR-10181 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 5.3, 5.4, 5.5, 6.4.1 > Reporter: Samuel García Martínez > Assignee: Erick Erickson > Attachments: SOLR-10181.patch > > > When several CREATEALIAS are run at the same time by the OCP it could happen > that, even tho the API response is OK, some of those CREATEALIAS request > changes are lost. > h3. The problem > The problem happens because the CREATEALIAS cmd implementation relies on > _zkStateReader.getAliases()_ to create the map that will be stored in ZK. If > several threads reach that line at the same time it will happen that only one > will be stored correctly and the others will be overridden. > The code I'm referencing is [this > piece|https://github.com/apache/lucene-solr/blob/8c1e67e30e071ceed636083532d4598bf6a8791f/solr/core/src/java/org/apache/solr/cloud/CreateAliasCmd.java#L65]. > As an example, let's say that the current aliases map has {a:colA, b:colB}. > If two CREATEALIAS (one adding c:colC and other creating d:colD) are > submitted to the _tpe_ and reach that line at the same time, the resulting > maps will look like {a:colA, b:colB, c:colC} and {a:colA, b:colB, d:colD} and > only one of them will be stored correctly in ZK, resulting in "data loss", > meaning that API is returning OK despite that it didn't work as expected. > On top of this, another concurrency problem could happen when the command > checks if the alias has been set using _checkForAlias_ method. if these two > CREATEALIAS zk writes had ran at the same time, the alias check fir one of > the threads can timeout since only one of the writes has "survived" and has > been "committed" to the _zkStateReader.getAliases()_ map. > h3. How to fix it > I can post a patch to this if someone gives me directions on how it should be > fixed. As I see this, there are two places where the issue can be fixed: in > the processor (OverseerCollectionMessageHandler) in a generic way or inside > the command itself. > h5. The processor fix > The locking mechanism (_OverseerCollectionMessageHandler#lockTask_) should be > the place to fix this inside the processor. I thought that adding the > operation name instead of only "collection" or "name" to the locking key > would fix the issue, but I realized that the problem will happen anyway if > the concurrency happens between different operations modifying the same > resource (like CREATEALIAS and DELETEALIAS do). So, if this should be the > path to follow I don't know what should be used as a locking key. > h5. The command fix > Fixing it at the command level (_CreateAliasCmd_ and _DeleteAliasCmd_) would > be relatively easy. Using optimistic locking, i.e, using the aliases.json zk > version in the keeper.setData. To do that, Aliases class should offer the > aliases version so the commands can forward that version with the update and > retry when it fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org