[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366452#comment-16366452 ] ASF subversion and git services commented on SOLR-11739: Commit dfb0803bbb972d730627fbdcd4df66558d06f13a in lucene-solr's branch refs/heads/branch_7x from [~tomasflobbe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dfb0803 ] SOLR-11739: Remove cast no longer needed > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366451#comment-16366451 ] ASF subversion and git services commented on SOLR-11739: Commit f6b6f5070270c93fa7d8604ed456c9df041e7454 in lucene-solr's branch refs/heads/branch_7x from [~tomasflobbe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6b6f50 ] SOLR-11739: Don't accept duplicate async IDs in collection API operations > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366450#comment-16366450 ] ASF subversion and git services commented on SOLR-11739: Commit 250e5b2aba2c25540f597f4436c619fd97febd0d in lucene-solr's branch refs/heads/master from [~tomasflobbe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=250e5b2 ] SOLR-11739: Remove cast no longer needed > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366446#comment-16366446 ] ASF subversion and git services commented on SOLR-11739: Commit 61ea8f60b1c69ba9ed753fe533d571fcbb452887 in lucene-solr's branch refs/heads/master from [~tomasflobbe] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=61ea8f6 ] SOLR-11739: Don't accept duplicate async IDs in collection API operations > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351606#comment-16351606 ] Tomás Fernández Löbbe commented on SOLR-11739: -- Thanks for the thorough review Anshum. {quote}We should add back the check in old places instead of leaving that as a TODO {quote} Yes, that was my plan. I'll upload a patch now with the back compat support. {quote}L296: It would be good to log the reason for the failure to offer the request to the OverseerCollectionQueue {quote} Note that the exception will be thrown in this case {quote}Are the log4j.properties changes intentional ? {quote} For my testing, I'll revert before committing {quote}SizeLimitedDistributedMap.OnChildDeleteObserver interface doesn’t need the keyword static also can you add some javadoc there? {quote} Good catch. Changed. {quote}As I understand, you would also want to override clear() and call onDeleteObserver.onChildDelete in those cases too. {quote} Really, I was thinking on the observer to be called only in the case of overflow, not for a regular delete (it's also not called on remove). I modified the name to {{OnOverflowObserver}} of the observer to be clearer about this. {quote}Javadocs for claim/clearAsyncId. It’d be good to mention that it returns false when it tries to clear an ID that doesn’t exist any more. {quote} Added Also added some more tests to the new code and to {{DistributedMap}} since I couldn't find any. {quote}...How about when the Overseer is elected, it establishes a source of entropy (Random initialized from time) and uses that to issue UUID's... {quote} That is another option. Although note that there is a client option to call with an auto-generated asyncId, which should prevent collisions. We could also switch to a model in which we don't accept client-provided async IDs, but I guess that should be another Jira, there would be much more changes for that, including an API change > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348004#comment-16348004 ] Gus Heck commented on SOLR-11739: - [~hossman], To me it seems that with your 3 command example things are a little worse than you say... There would be a window where id 222 could be seen as the success of CREATESOMETHING, and someone checking on DOWHATEVER might think DOWHATEVER had been done successfully (yay, go home, throw a party... ) but then DOWHATEVER fails (Monday's gonna be less fun...), and then some automated process checks on 222 to verify that it did actually CREATESOMETHING, but sees a failure... (drat, do it again... and again and again.. continually failing because SOMETHING now exists). Sure, it's their fault for not coordinating their ID's but... why help them make that mistake? I think any ID that is not unique is more or less useless. I haven't used async requests and haven't previously paid much attention to it, don't know the history, and I might be missing something, but I find it shocking that Solr is not generating the id and ensuring it's uniqueness. How about when the Overseer is elected, it establishes a source of entropy (Random initialized from time) and uses that to issue UUID's. There's only one overseer at a time, and the cases where 2 or more overseers are started at exactly the same time and coexist is a bug, right? If there's no overseer, commands can't be run anyway... > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314192#comment-16314192 ] Anshum Gupta commented on SOLR-11739: - Thanks for taking this up [~tomasflobbe]! This looks good and not super invasive. Here's some feedback: * CollectionsHandler.java ** We should add back the check in old places instead of leaving that as a TODO ** L296: It would be good to log the reason for the failure to offer the request to the OverseerCollectionQueue * Are the log4j.properties changes intentional ? * SizeLimitedDistributedMap.OnChildDeleteObserver interface doesn’t need the keyword static also can you add some javadoc there? * As I understand, you would also want to override clear() and call onDeleteObserver.onChildDelete in those cases too. * Javadocs for claim/clearAsyncId. It’d be good to mention that it returns false when it tries to clear an ID that doesn’t exist any more. > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298901#comment-16298901 ] Tomás Fernández Löbbe commented on SOLR-11739: -- bq. feel like my suggestion was orthognal to to the concern you suggested in response. No, I meant "Solr won't re-execute the same request twice" *even if it receives it twice*. What I say is that, by Solr rejecting the duplicate async ID it makes the admin request idempotent, you could do something like: {code} while (!success) { try { performRequest(asyncId=1=CREATESOMETHING) success = true } catch (e) { //backoff } } {code} If there is some error back from {{performRequest}}, you don’t know if the request was scheduled or not, but you don’t care if you know Solr won’t re-schedule it, you can just send it again. Another use case could be workers watching a queue and executing actions, if the queue delivers a message more than once, you don’t have to worry about sending the command to Solr multiple times, and can assume a “request already exists” response from Solr to be a success. > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297890#comment-16297890 ] Hoss Man commented on SOLR-11739: - i have no strong opinions ... at a high level i think the approach you describe (i didn't review the patch) is fine from a stability/reliability and back compat standpoint, but I have no idea how it affects performance (although it should go w/o saying that i favor stability/reliability over performance when discussing "admin" actions) To clarify one comment where i suspect we may have been misscommunicating... {quote} bq. Solr should happily let you specify the same asyncId multiple times hmm I'm not sure. I think the fact that Solr won't re-execute the same request twice makes it much easier to write workflows that do collection operations. {quote} I feel like my suggestion was orthoginal to to the concern you suggested in response. What i was suggesting is that, in theory: Solr could be agnostic to the asyncId and *only* use them for tracking results. Example: if these 4 commands come in... * asyncId=222=CREATESOMETHING * asyncId=1=DOSOMETHINGELSE * asyncId=222=DOWHATEVER ...then Solr could in fact execute those 3 commands, in that sequence, reliably, and w/o any risk of any of them being executed more or less then exactly 1 time. The fact that 2 of them have the same asyncId should in no way impact Solr's execution of those commands. The only impact that the duplicated asyncId should have is that once the DOWHATEVER cmd is executed, it will no longer be possible for a client to ever check the status of the CREATESOMETHING, because the results of the DOWHATEVER command should overwrite the results of the CREATESOMETHING ... that could be solr's *sole* internal use of the asyncId (ie: only track all commands to process in an ordered queue, use the asyncId only in a map we don't care about internally, other then if a user asks for status info) that was the crux of my suggestion ... assuming we want a policy of solr being "agnostic" to duplicate asyncIds in order to reduce the zk overhead of tracking/rejecting dups. (BUt i don't feel strongly about it) > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch, SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286410#comment-16286410 ] Tomás Fernández Löbbe commented on SOLR-11739: -- bq. Solr should assign the asyncIds and guarantee that they are unique We now do something like this when async IDs are not provided, but *on the client side*. See https://github.com/apache/lucene-solr/blob/41644bdcdcc0734115ce08ec24d6b408e1f8cf28/solr/solrj/src/java/org/apache/solr/client/solrj/request/CollectionAdminRequest.java#L150-L152 bq. It sounds like tomas is saying that there is existing code in solr that tries to reject duplicate asyncIds Yes, here: https://github.com/apache/lucene-solr/blob/41644bdcdcc0734115ce08ec24d6b408e1f8cf28/solr/core/src/java/org/apache/solr/handler/admin/CollectionsHandler.java#L282-L286 bq. Solr should happily let you specify the same asyncId multiple times hmm I'm not sure. I think the fact that Solr won't re-execute the same request twice makes it much easier to write workflows that do collection operations. bq. isn't the amount of work / zk writes needed to generate a universally unique asyncId on the server side essentially the same as the amount needed to tell the client that the asyncId they specified isn't unique? We now do a bunch of reads to check if the async IDs are in any of the current queues/maps (see the code I linked above). I was considering either writing to some ZooKeeper path the async ID as part of a lock, before checking the existing queues and then deleting the lock, or moving completely the async IDs to it's own path, and starting using this as the source of truth of asyncIDs. The later would then make it easier to check if an async ID is currently in use, however it's a much bigger change and we'll need to consider back compatibility > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286346#comment-16286346 ] Hoss Man commented on SOLR-11739: - My off the cuff, uneducated, impression w/o knowing much about the internals or the history of how the existing code go to the state it's currently in is that... Either: # Solr should assign the asyncIds and guarantee that they are unique # The user should assign the asyncIds, and Solr should make no assumptions about them, nor use them for *anything* other then reporting status. #1 seems a lot harder since it's essentially a distributed unique key generation problem, which IIUC is why asyncId wasn't implemented that way in the first place. For #2, from my perspective, It sounds like tomas is saying that there is existing code in solr that tries to reject duplicate asyncIds -- and I would argue (as a straw man) that Solr making any attempt at doing that is where the real bug lies ... Solr should happily let you specify the same asyncId multiple times, and that should have no affect at all on the reliability of the commands being executed in the order recieved. the only thing it should affect is that requesting status info on the commands may give unexpected results (depending on what the client is expecting) ... i would expect that requesting status for the id would return the status of the "1st" instance, until the "2nd" instance finishes at which point the status info is overridden. that way if a user wants to re-use the exact same asyncId for every request, they are welcome to put that bullet in their foot as many times as they want -- it keeps things simpler for us internally, and we're not trying to coddle them for doing something (very advanced) in a silly way. If we're going to coddle them, then we should coddle them all the way -- isn't the amount of work / zk writes needed to generate a universally unique asyncId on the server side essentially the same as the amount needed to tell the client that the asyncId they specified isn't unique? > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Assignee: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11739) Solr can accept duplicated async IDs
[ https://issues.apache.org/jira/browse/SOLR-11739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284147#comment-16284147 ] Tomás Fernández Löbbe commented on SOLR-11739: -- I thought about three options 1. Fix the actual race condition, don't let duplicate async IDs at all. 2. Fix the Overseer so that it checks before running each task if one with the same ID was completed before. 3. Let the Overseer re-run the tasks (leave it as it is now). Maybe just add logging, or a way to show the error (failed tasks) #3 can be dangerous, since the task could be something like a DELETEREPLICA. If the duplicate ID was caused by some broken retry logic on the client side, Solr could be deleting many replicas with what the client thought was a single command. #2 may be OK, the problem I see with that is that it gives an inconsistent behavior to the user (sometimes the duplicate IDs are rejected, and sometimes not). Also, this would make the Overseer silently drop tasks (yes, we can add some sort of failure in the logs but we can’t assume anyone is going to notice). #1 is the correct fix from the functional stand point, however I can’t think of a way to really fix the race condition without adding an extra write to ZooKeeper, which we’d have to do for every collection request with an asyncID. And this is to cover from a client misuse edge case. I think (and I discussed this offline with [~anshumg], he thinks this too) #1 is the way to go. I’ll put up a patch. > Solr can accept duplicated async IDs > > > Key: SOLR-11739 > URL: https://issues.apache.org/jira/browse/SOLR-11739 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Tomás Fernández Löbbe >Priority: Minor > Attachments: SOLR-11739.patch > > > Solr is supposed to reject duplicated async IDs, however, if the repeated IDs > are sent fast enough, a race condition in Solr will let the repeated IDs > through. The duplicated task is ran and and then silently fails to report as > completed because the same async ID is already in the completed map. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org