[
https://issues.apache.org/jira/browse/SOLR-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Samuel García Martínez updated SOLR-10181:
------------------------------------------
Description:
When several CREATEALIAS are run at the same time by the OCP it could happen
that, even tho the API response is OK, some of those CREATEALIAS request
changes are lost.
h3. The problem
The problem happens because the CREATEALIAS cmd implementation relies on
_zkStateReader.getAliases()_ to create the map that will be stored in ZK. If
several threads reach that line at the same time it will happen that only one
will be stored correctly and the others will be overridden.
The code I'm referencing is [this
piece|https://github.com/apache/lucene-solr/blob/8c1e67e30e071ceed636083532d4598bf6a8791f/solr/core/src/java/org/apache/solr/cloud/CreateAliasCmd.java#L65].
As an example, let's say that the current aliases map has {a:colA, b:colB}. If
two CREATEALIAS (one adding c:colC and other creating d:colD) are submitted to
the _tpe_ and reach that line at the same time, the resulting maps will look
like {a:colA, b:colB, c:colC} and {a:colA, b:colB, d:colD} and only one of them
will be stored correctly in ZK, resulting in "data loss", meaning that API is
returning OK despite that it didn't work as expected.
On top of this, another concurrency problem could happen when the command
checks if the alias has been set using _checkForAlias_ method. if these two
CREATEALIAS zk writes had ran at the same time, the alias check fir one of the
threads can timeout since only one of the writes has "survived" and has been
"committed" to the _zkStateReader.getAliases()_ map.
h3. How to fix it
I can post a patch to this if someone gives me directions on how it should be
fixed. As I see this, there are two places where the issue can be fixed: in the
processor (OverseerCollectionMessageHandler) in a generic way or inside the
command itself.
h5. The processor fix
The locking mechanism (_OverseerCollectionMessageHandler#lockTask_) should be
the place to fix this inside the processor. I thought that adding the operation
name instead of only "collection" or "name" to the locking key would fix the
issue, but I realized that the problem will happen anyway if the concurrency
happens between different operations modifying the same resource (like
CREATEALIAS and DELETEALIAS do). So, if this should be the path to follow I
don't know what should be used as a locking key.
h5. The command fix
Fixing it at the command level (_CreateAliasCmd_ and _DeleteAliasCmd_) would be
relatively easy. Using optimistic locking, i.e, using the aliases.json zk
version in the keeper.setData. To do that, Aliases class should offer the
aliases version so the commands can forward that version with the update and
retry when it fails.
was:
When several CREATEALIAS are run at the same time by the OCP it could happen
that, even tho the API response is OK, some of those CREATEALIAS request
changes are lost.
The problem happens because the CREATEALIAS cmd implementation relies on
zkStateReader.getAliases() to create the map that will be stored in ZK. If
several threads reach that line at the same time it will happen that only one
will be stored correctly and the others will be overridden.
The code I'm referencing is [this
piece|https://github.com/apache/lucene-solr/blob/8c1e67e30e071ceed636083532d4598bf6a8791f/solr/core/src/java/org/apache/solr/cloud/CreateAliasCmd.java#L65].
As an example, let's say that the current aliases map has {a:colA, b:colB}. If
two CREATEALIAS (one adding c:colC and other creating d:colD) are scheduled in
the _tpe_ and reach that line at the same time, the resulting maps will look
like {a:colA, b:colB, c:colC} and {a:colA, b:colB, d:colD} and only one of them
will be stored correctly in ZK, resulting in "data loss", meaning that API is
returning OK despite that it didn't work as expected.
On top of this, another concurrency problem could happen when the command
checks the alias being set using _checkForAlias_ method. After the two
CREATEALIAS zk write being run at the same time, when the alias is being check
one of the threads can timeout since only one of them has "survived" and has
been written to the _zkStateReader.getAliases()_ map.
I can post a patch to this if someone gives me directions on how it sould be
fixed. As I see this, there are two places where the issue can be fixed: in the
processor (OverseerCollectionMessageHandler) in a generic way or inside the
command itself.
The processor fix
The locking mechanism (OverseerCollectionMessageHandler#lockTask) should be the
place to fix this inside the processor. I thought that adding the operation
name instead of only "collection" or "name" to the locking key would fix the
issue, but I realized that the problem will happen anyway if the concurrency
happens between different operations modifying the same resource (like
CREATEALIAS and DELETEALIAS do). So, if this should be the path to follow I
don't know what should be used as a locking key.
The command fix
Fixing it at the command level (CreateAliasCmd and DeleteAliasCmd) would be
relatively easy. Using optimistic locking, i.e, using the aliases.json zk
version in the keeper.setData. To do that, Aliases class should offer the
aliases version so the commands can forward that version with the update and
retry when it fails.
> CREATEALIAS and DELETEALIAS commands consistency problems under concurrency
> ---------------------------------------------------------------------------
>
> Key: SOLR-10181
> URL: https://issues.apache.org/jira/browse/SOLR-10181
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Affects Versions: 5.3, 5.4, 5.5, 6.4.1
> Reporter: Samuel García Martínez
>
> When several CREATEALIAS are run at the same time by the OCP it could happen
> that, even tho the API response is OK, some of those CREATEALIAS request
> changes are lost.
> h3. The problem
> The problem happens because the CREATEALIAS cmd implementation relies on
> _zkStateReader.getAliases()_ to create the map that will be stored in ZK. If
> several threads reach that line at the same time it will happen that only one
> will be stored correctly and the others will be overridden.
> The code I'm referencing is [this
> piece|https://github.com/apache/lucene-solr/blob/8c1e67e30e071ceed636083532d4598bf6a8791f/solr/core/src/java/org/apache/solr/cloud/CreateAliasCmd.java#L65].
> As an example, let's say that the current aliases map has {a:colA, b:colB}.
> If two CREATEALIAS (one adding c:colC and other creating d:colD) are
> submitted to the _tpe_ and reach that line at the same time, the resulting
> maps will look like {a:colA, b:colB, c:colC} and {a:colA, b:colB, d:colD} and
> only one of them will be stored correctly in ZK, resulting in "data loss",
> meaning that API is returning OK despite that it didn't work as expected.
> On top of this, another concurrency problem could happen when the command
> checks if the alias has been set using _checkForAlias_ method. if these two
> CREATEALIAS zk writes had ran at the same time, the alias check fir one of
> the threads can timeout since only one of the writes has "survived" and has
> been "committed" to the _zkStateReader.getAliases()_ map.
> h3. How to fix it
> I can post a patch to this if someone gives me directions on how it should be
> fixed. As I see this, there are two places where the issue can be fixed: in
> the processor (OverseerCollectionMessageHandler) in a generic way or inside
> the command itself.
> h5. The processor fix
> The locking mechanism (_OverseerCollectionMessageHandler#lockTask_) should be
> the place to fix this inside the processor. I thought that adding the
> operation name instead of only "collection" or "name" to the locking key
> would fix the issue, but I realized that the problem will happen anyway if
> the concurrency happens between different operations modifying the same
> resource (like CREATEALIAS and DELETEALIAS do). So, if this should be the
> path to follow I don't know what should be used as a locking key.
> h5. The command fix
> Fixing it at the command level (_CreateAliasCmd_ and _DeleteAliasCmd_) would
> be relatively easy. Using optimistic locking, i.e, using the aliases.json zk
> version in the keeper.setData. To do that, Aliases class should offer the
> aliases version so the commands can forward that version with the update and
> retry when it fails.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]