[jira] [Updated] (SOLR-445) Update Handlers abort with bad documents

Hoss Man (JIRA) Fri, 05 Feb 2016 15:43:33 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-445:
--------------------------
    Attachment: SOLR-445.patch

Ok, i've been going around in circles on this issue for the past few weeks, but 
i've finally got something that feels like progress.

Listing everything new since the last patch would be fairely tedioius, so i'll 
focus on the broad strokes...

* punting on deletes (for now)
** the previuos processDelete method was totally broken in distributed queries 
(was doing equality comparisons of Strings directly to enum objects, made 
assumptions about {{isLeader}} pased on previous "add" commands in the same 
request, etc...)
** i was having enough time struggling with getting adds to work properly, that 
i just removed the delete code completely
** based on the rest of the progress (see below) re-adding support for deletes 
should be straight forward, but will require some refactoring of how the errors 
are tracked & counted to distinguish between an "add doc w/id=22" that fails 
and a "delete doc w/id=22" that fails in the same request
* Dealing with forward(ing/ed) requests and async distributed failures
** Most of the meat of the test failures in the last patch came from dealing 
with the async requests fired off my 
{{DistributedUpdateProcessor}} and how to deal with failures reported by other 
leaders.
** I started down the road of trying to do a bettter job in 
{{SolrCmdDistributor}} of tracking the {{UpdateCommand}} that corisponds with 
each {{Req}} so that when processing the {{Error}} we could know what failed 
remotely -- this code is still in the patch (because i think it's cleaner then 
the {{cmdString}} tracking currently in Solr) but proved mostly useless because 
of how {{ConcurrentUpdateSolrClient}} can combine multiple "requests" together.
** The next step, which mostly worked, was to improve the error handling in 
DUP's {{finish()}} so that instead of aborting with info about whatever 
(remote) Error happened to return first, it now returns a new 
{{DistributedUpdatesAsyncException}} which wraps & remembers the summary info 
of all the remote errors encountered -- or at least, all of the errors that 
were previously candidates to tell the user about.  Stuff that was swallowed & 
logged before is still swallowed & logged.
*** One notable change her is that i switched {{DUP.finish()}} from directly 
calling {{SOlrQueryResponse.setException()}} and instead made it throw the 
exception.  Independent of this issue, the existing behavior seems like a bug / 
bad-form -- what if the caller already caught some earlier exception it wants 
to return and {{finish()}} is just being called in finally?
*** This lead me to discover SOLR-8633 -- the patch from that issue is 
currently including in this patch because it's so neccessary for hte current 
code.
*** FWIW: if, for some reason, folks think calling 
{{SOlrQueryResponse.setException()}} is better for some wacko reason, then 
{{TolerantUpdateProcess.finish()}} could, in theory, go check 
{{SOlrQueryResponse.getException()}} and treat it the same way as exceptions it 
catches (see below) but that's a lot more tedious and (and error prone in the 
long run)
** Once {{DUP.finish()}} started throwing {{DistributedUpdatesAsyncException}}, 
tracking all the various errors we care about from distributed requests became 
possible...
*** {{TolerantUpdateProcessor}} now pays attention to when the request is a 
{{TOLEADER}} forwarded request, and if so:
**** it acts tolerant of up to {{maxErrors}} failures (just like single node)
**** in {{finish()}}, if *any* failures happened, return the first error -- and 
annotate it with info about *all* the failures in this request, using the 
existing {{SolrException.getMetadata()}} map.
*** {{ConcurrentUpdateSolrClient}} already ensures that if a {{SolrException}} 
happens on the remote server, any "metadata" included in the response for that 
exception is copied into the local {{SolrException}} it constructs
*** So if/when {{TolerantUpdateProcessor.finish()}} catches a 
{{DistributedUpdatesAsyncException}}, it loops over the wrapped exceptions, and 
pulls out the metadata for each of those to update it's list of known failures 
* which error is returned if {{maxErrors}} exceeded: now the "first" one
** i mentioned this above in discussing how async errors from other leaders are 
handled, but i wanted to emphasis it and elaborate a bit
** in the original patch, exceptions were *completley* ignored until 
{{maxErrors}} were seen, at which point hte next exception was immediately 
(re)-thrown
** that made sense for single node cases, but in a distributed case, with async 
exceptions, we can't always just "rethrow" an exception
*** it's also ambiguious what the "last" exception really is in an async 
situation.
** so now, instead,  {{TolerantUpdateProcessor}} always remembers the *first* 
exception it caught, and if/when {{maxErrors}} is exceeded, it re-throws that 
first exception
*** later, in {{finish()}} that (remembered) exception is annotated with 
metadata about _all_ the failures
**** this metadata is critical for forwarded requests, but should also be 
useful for {{SolrClient}} users who (should) see it copied into a 
{{RemoteSolrException}} even if they don't get the normal {{UpdateResponse}} 
object and it's {{getResponseHeader()}}
** this, in my opinion, makes the behavior of using {{TolerantUpdateProcessor}} 
more useful (and consistent with how stock solr works) when there is a 
fundamental problem with your data -- you get an error indicating the _first_ 
problem document in your data, as opposed to seeing an error from the 11th, or 
101th, or {{maxErrors+1}} malformed document in your data.
** if folks disagree, we just need to re-work the {{FirstErrTracker}} class to 
be a {{LastErrTracker}} class...
*** {{FirstErrTracker.throwFirst()}} becomes {{LastErrTracker.throwLast()}}
*** instead of checking {{null == first}}, 
{{LastErrorTracker.caught(Throwable)}} would ignore any additional exceptions 
once {{true == thrown}}
* the problem with returning {{numAdds}} (and {{numDeletes}} if we want that)
** getting {{numAdds}} to work correctly in a distributed request is really hard
** currently the code (specifically on the first node processing the 
{{AddUpdateCommand}} just requires that {{super.processAdd()}} succeeds to do 
{{numAdds++}}
*** this gives a missleading number when the failures aren't reported 
immediately because they were forwarded async to a diff leader and we only find 
out about problems in {{finish()}}
** even if we only did {{numAdds++}} for docs where we know we are the leader, 
getting the count from _other_ leaders later is tricky
*** right now, the only way {{TolerantUpdateProcessor}} learns the results of 
any async requests to other leaders is if {{DUP.finish()}} throws an error -- 
we can get {{numAdds}} from the metadata of those errors, but that doesn't help 
the case when requests succeed
*** {{DUP}} doesn't currently do anything with successful responses, so there's 
no easy way to get the numAdds in that case
** possible solutions...
*** eliminate the concept of {{numAdds}} from this feature, focus solely on 
being tolerant and reporting the errors (which we can do accurately)
**** if people want to know how many succeeded, they can use features like 
{{versions=true}} explicitly -- although even then, i'm not sure if it will 
work reliably today ... i don't see any indication that DUP merges the remote 
responses when that feature is used.
*** make {{TolerantUpdateProcessor}} detect when it's not a leader for 
something, and in that case *implicitly* add {{version=true}} to get the info 
from requests we forward to, and prune the response down to just a simple 
{{numAdds}} count in finish()
**** same problem as above if i'm correct about {{versions=true}} not currently 
handled correctly in forward(ing/ed) requests
*** return distinct {{numAddsAttempted}} and {{numAddsConfirmed}} values -- 
probably not as useful to end clients, but more accurate representation of what 
we know...
**** track a distinct "numAddsAttempted" for _every_ shard we forward to (added 
together at end of request)
***** kind of a sketchy pain in the ass to do
**** assume {{numAddsConfirmed = numAddsAttempted}} for any other shard leader 
we dont see an exception from
**** for shard leaders we do get exceptions from, add to our numAddsConfirmed 
based on the metadata (ie: {{numAddsConfirmed + 
remoteException.getMetatata(NUM_ADDS_CONFIRMED)}}
*** treat it as a distinct feature in a new jira
**** geting {{numAdds}} (or {{numDeletes}}) count in the responseHeader really 
feels like it should be an orthoginal feature to being tolerant of {{maxErrors}}
**** we should open a distinct issue to track adding that as a feature, and 
ensuring that DUP aggregates correctly from all the various leaders that 
requests get forwarded to
**** this is waht i personally think we should do -- notably because it would 
make it trivial to know how many documents you added when streaming a bunch of 
data, even if you *don't* use this update processor.
* the nuance of the {{maxErrors=N}} param
** i just want to point out -- in a distributed cloud setup, there is no way to 
truely enforce a hard limit at {{maxErrors}}
** the async processing means that if a request includes docs destined for diff 
shards, we can't ensure that *only* that max amount of errors will be hit -- we 
might hit more errors in async threads before we notice and stop processing
** i don't think this is a problem, it's a feature of handling updates to diff 
shards/leaders in paralel, but it does mean we might want to rethink the param 
name ? ({{errorTolleranceThreshold=N}} perhaps?


There's still a lot of work todo, in no particular order...

* deletes
** need to refactor the way we track errors (both locally and stashed in the 
SolrException metadata) so that we can we can distinguish "add doc w/id=22" 
faiulres from "delete doc w/id=22" failures
** probably need to rethink the {{responseHeader}} formatting for how errors 
are tracked as well in order to distinguish them
** once we have that, adding a {{processDeletes(...)}} method that works 
similar to processAdd should be trivial
** need to beef up the cloud testing to include delete checks
* {{CloudSolrClient}}
** all of the tests using {{CloudSolrClient}} currently fail because of how 
that client does it's own "direct to leaders" splitting/merging of docs 
destined for diff shards.
** a bunch of new code needs written there (may be able to refactor/share some 
stuff from the error list merging code in {{TolerantUpdateProcessor}} (see 
above about refactoring that for deletes)
* need lots more randomized testing
* tons of nocommits that need cleaned up


> Update Handlers abort with bad documents
> ----------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Will Johnson
>            Assignee: Anshum Gupta
>         Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch, 
> SOLR-445-alternative.patch, SOLR-445-alternative.patch, 
> SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, 
> SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, 
> SOLR-445.patch, SOLR-445.patch, SOLR-445_3x.patch, solr-445.xml
>
>
> Has anyone run into the problem of handling bad documents / failures mid 
> batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it 
> should either fail the entire batch or log a message/return a code and then 
> continue on to add doc 3.  Option 1 would seem to be much harder to 
> accomplish and possibly require more memory while Option 2 would require more 
> information to come back from the API.  I'm about to dig into this but I 
> thought I'd ask to see if anyone had any suggestions, thoughts or comments.   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-445) Update Handlers abort with bad documents

Reply via email to