[
https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-445:
--------------------------
Attachment: SOLR-445.patch
Ok, i've been going around in circles on this issue for the past few weeks, but
i've finally got something that feels like progress.
Listing everything new since the last patch would be fairely tedioius, so i'll
focus on the broad strokes...
* punting on deletes (for now)
** the previuos processDelete method was totally broken in distributed queries
(was doing equality comparisons of Strings directly to enum objects, made
assumptions about {{isLeader}} pased on previous "add" commands in the same
request, etc...)
** i was having enough time struggling with getting adds to work properly, that
i just removed the delete code completely
** based on the rest of the progress (see below) re-adding support for deletes
should be straight forward, but will require some refactoring of how the errors
are tracked & counted to distinguish between an "add doc w/id=22" that fails
and a "delete doc w/id=22" that fails in the same request
* Dealing with forward(ing/ed) requests and async distributed failures
** Most of the meat of the test failures in the last patch came from dealing
with the async requests fired off my
{{DistributedUpdateProcessor}} and how to deal with failures reported by other
leaders.
** I started down the road of trying to do a bettter job in
{{SolrCmdDistributor}} of tracking the {{UpdateCommand}} that corisponds with
each {{Req}} so that when processing the {{Error}} we could know what failed
remotely -- this code is still in the patch (because i think it's cleaner then
the {{cmdString}} tracking currently in Solr) but proved mostly useless because
of how {{ConcurrentUpdateSolrClient}} can combine multiple "requests" together.
** The next step, which mostly worked, was to improve the error handling in
DUP's {{finish()}} so that instead of aborting with info about whatever
(remote) Error happened to return first, it now returns a new
{{DistributedUpdatesAsyncException}} which wraps & remembers the summary info
of all the remote errors encountered -- or at least, all of the errors that
were previously candidates to tell the user about. Stuff that was swallowed &
logged before is still swallowed & logged.
*** One notable change her is that i switched {{DUP.finish()}} from directly
calling {{SOlrQueryResponse.setException()}} and instead made it throw the
exception. Independent of this issue, the existing behavior seems like a bug /
bad-form -- what if the caller already caught some earlier exception it wants
to return and {{finish()}} is just being called in finally?
*** This lead me to discover SOLR-8633 -- the patch from that issue is
currently including in this patch because it's so neccessary for hte current
code.
*** FWIW: if, for some reason, folks think calling
{{SOlrQueryResponse.setException()}} is better for some wacko reason, then
{{TolerantUpdateProcess.finish()}} could, in theory, go check
{{SOlrQueryResponse.getException()}} and treat it the same way as exceptions it
catches (see below) but that's a lot more tedious and (and error prone in the
long run)
** Once {{DUP.finish()}} started throwing {{DistributedUpdatesAsyncException}},
tracking all the various errors we care about from distributed requests became
possible...
*** {{TolerantUpdateProcessor}} now pays attention to when the request is a
{{TOLEADER}} forwarded request, and if so:
**** it acts tolerant of up to {{maxErrors}} failures (just like single node)
**** in {{finish()}}, if *any* failures happened, return the first error -- and
annotate it with info about *all* the failures in this request, using the
existing {{SolrException.getMetadata()}} map.
*** {{ConcurrentUpdateSolrClient}} already ensures that if a {{SolrException}}
happens on the remote server, any "metadata" included in the response for that
exception is copied into the local {{SolrException}} it constructs
*** So if/when {{TolerantUpdateProcessor.finish()}} catches a
{{DistributedUpdatesAsyncException}}, it loops over the wrapped exceptions, and
pulls out the metadata for each of those to update it's list of known failures
* which error is returned if {{maxErrors}} exceeded: now the "first" one
** i mentioned this above in discussing how async errors from other leaders are
handled, but i wanted to emphasis it and elaborate a bit
** in the original patch, exceptions were *completley* ignored until
{{maxErrors}} were seen, at which point hte next exception was immediately
(re)-thrown
** that made sense for single node cases, but in a distributed case, with async
exceptions, we can't always just "rethrow" an exception
*** it's also ambiguious what the "last" exception really is in an async
situation.
** so now, instead, {{TolerantUpdateProcessor}} always remembers the *first*
exception it caught, and if/when {{maxErrors}} is exceeded, it re-throws that
first exception
*** later, in {{finish()}} that (remembered) exception is annotated with
metadata about _all_ the failures
**** this metadata is critical for forwarded requests, but should also be
useful for {{SolrClient}} users who (should) see it copied into a
{{RemoteSolrException}} even if they don't get the normal {{UpdateResponse}}
object and it's {{getResponseHeader()}}
** this, in my opinion, makes the behavior of using {{TolerantUpdateProcessor}}
more useful (and consistent with how stock solr works) when there is a
fundamental problem with your data -- you get an error indicating the _first_
problem document in your data, as opposed to seeing an error from the 11th, or
101th, or {{maxErrors+1}} malformed document in your data.
** if folks disagree, we just need to re-work the {{FirstErrTracker}} class to
be a {{LastErrTracker}} class...
*** {{FirstErrTracker.throwFirst()}} becomes {{LastErrTracker.throwLast()}}
*** instead of checking {{null == first}},
{{LastErrorTracker.caught(Throwable)}} would ignore any additional exceptions
once {{true == thrown}}
* the problem with returning {{numAdds}} (and {{numDeletes}} if we want that)
** getting {{numAdds}} to work correctly in a distributed request is really hard
** currently the code (specifically on the first node processing the
{{AddUpdateCommand}} just requires that {{super.processAdd()}} succeeds to do
{{numAdds++}}
*** this gives a missleading number when the failures aren't reported
immediately because they were forwarded async to a diff leader and we only find
out about problems in {{finish()}}
** even if we only did {{numAdds++}} for docs where we know we are the leader,
getting the count from _other_ leaders later is tricky
*** right now, the only way {{TolerantUpdateProcessor}} learns the results of
any async requests to other leaders is if {{DUP.finish()}} throws an error --
we can get {{numAdds}} from the metadata of those errors, but that doesn't help
the case when requests succeed
*** {{DUP}} doesn't currently do anything with successful responses, so there's
no easy way to get the numAdds in that case
** possible solutions...
*** eliminate the concept of {{numAdds}} from this feature, focus solely on
being tolerant and reporting the errors (which we can do accurately)
**** if people want to know how many succeeded, they can use features like
{{versions=true}} explicitly -- although even then, i'm not sure if it will
work reliably today ... i don't see any indication that DUP merges the remote
responses when that feature is used.
*** make {{TolerantUpdateProcessor}} detect when it's not a leader for
something, and in that case *implicitly* add {{version=true}} to get the info
from requests we forward to, and prune the response down to just a simple
{{numAdds}} count in finish()
**** same problem as above if i'm correct about {{versions=true}} not currently
handled correctly in forward(ing/ed) requests
*** return distinct {{numAddsAttempted}} and {{numAddsConfirmed}} values --
probably not as useful to end clients, but more accurate representation of what
we know...
**** track a distinct "numAddsAttempted" for _every_ shard we forward to (added
together at end of request)
***** kind of a sketchy pain in the ass to do
**** assume {{numAddsConfirmed = numAddsAttempted}} for any other shard leader
we dont see an exception from
**** for shard leaders we do get exceptions from, add to our numAddsConfirmed
based on the metadata (ie: {{numAddsConfirmed +
remoteException.getMetatata(NUM_ADDS_CONFIRMED)}}
*** treat it as a distinct feature in a new jira
**** geting {{numAdds}} (or {{numDeletes}}) count in the responseHeader really
feels like it should be an orthoginal feature to being tolerant of {{maxErrors}}
**** we should open a distinct issue to track adding that as a feature, and
ensuring that DUP aggregates correctly from all the various leaders that
requests get forwarded to
**** this is waht i personally think we should do -- notably because it would
make it trivial to know how many documents you added when streaming a bunch of
data, even if you *don't* use this update processor.
* the nuance of the {{maxErrors=N}} param
** i just want to point out -- in a distributed cloud setup, there is no way to
truely enforce a hard limit at {{maxErrors}}
** the async processing means that if a request includes docs destined for diff
shards, we can't ensure that *only* that max amount of errors will be hit -- we
might hit more errors in async threads before we notice and stop processing
** i don't think this is a problem, it's a feature of handling updates to diff
shards/leaders in paralel, but it does mean we might want to rethink the param
name ? ({{errorTolleranceThreshold=N}} perhaps?
There's still a lot of work todo, in no particular order...
* deletes
** need to refactor the way we track errors (both locally and stashed in the
SolrException metadata) so that we can we can distinguish "add doc w/id=22"
faiulres from "delete doc w/id=22" failures
** probably need to rethink the {{responseHeader}} formatting for how errors
are tracked as well in order to distinguish them
** once we have that, adding a {{processDeletes(...)}} method that works
similar to processAdd should be trivial
** need to beef up the cloud testing to include delete checks
* {{CloudSolrClient}}
** all of the tests using {{CloudSolrClient}} currently fail because of how
that client does it's own "direct to leaders" splitting/merging of docs
destined for diff shards.
** a bunch of new code needs written there (may be able to refactor/share some
stuff from the error list merging code in {{TolerantUpdateProcessor}} (see
above about refactoring that for deletes)
* need lots more randomized testing
* tons of nocommits that need cleaned up
> Update Handlers abort with bad documents
> ----------------------------------------
>
> Key: SOLR-445
> URL: https://issues.apache.org/jira/browse/SOLR-445
> Project: Solr
> Issue Type: Improvement
> Components: update
> Affects Versions: 1.3
> Reporter: Will Johnson
> Assignee: Anshum Gupta
> Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch,
> SOLR-445-alternative.patch, SOLR-445-alternative.patch,
> SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
> SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch,
> SOLR-445.patch, SOLR-445.patch, SOLR-445_3x.patch, solr-445.xml
>
>
> Has anyone run into the problem of handling bad documents / failures mid
> batch. Ie:
> <add>
> <doc>
> <field name="id">1</field>
> </doc>
> <doc>
> <field name="id">2</field>
> <field name="myDateField">I_AM_A_BAD_DATE</field>
> </doc>
> <doc>
> <field name="id">3</field>
> </doc>
> </add>
> Right now solr adds the first doc and then aborts. It would seem like it
> should either fail the entire batch or log a message/return a code and then
> continue on to add doc 3. Option 1 would seem to be much harder to
> accomplish and possibly require more memory while Option 2 would require more
> information to come back from the API. I'm about to dig into this but I
> thought I'd ask to see if anyone had any suggestions, thoughts or comments.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]