[ 
https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-445:
--------------------------------

    Attachment: solr-445.xml
                SOLR-445.patch

Here's a cut at an improvement at least.

The attached XML file contains an <add> packet with a number of documents 
illustrating a number of errors. The xml file can be POSTed Solr to index via 
the post.jar file so you can see the output.

This patch attempts to report back to the user the following for each document 
that failed:
1> the ordinal position in the file where the error occurred (e.g. the first, 
second, etc <doc> tag).
2> the <uniqueKey> if available.
3> the error.

The general idea is to accrue the errors in a StringBuilder and eventually 
re-throw the error after processing as far as possible.

Issues:
1> the reported format in the log file is kind of hard to read. I 
pipe-delimited the various <doc> tags, but they run together in a Windows DOS 
window. What happens on Unix I'm not quite sure. Suggestions welcome.
2> From the original post, rolling this back will be tricky. Very tricky. The 
autocommit feature makes it indeterminate what's been committed to the index, 
so I don't know how to even approach rolling back everything.
3> The intent here is to give the user a clue where to start when figuring out 
what document(s) failed so they don't have to guess.
4> Tests fail, but I have no clue why. I checked out a new copy of trunk and 
that fails as well, so I don't think that this patch is the cause of the 
errors. But let's not commit this until we can be sure.
5> What do you think about limiting the number of docs that fail before 
quitting? One could imagine some ratio (say 10%) that have to fail before 
quitting (with some safeguards, like don't bother calculating the ratio until 
20 docs had been processed or...). Or an absolute number. Should this be a 
parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs 
fail, there's something pretty fundamentally wrong and it's a waste to keep on. 
I don't have any strong feeling here, I can argue it either way....
6> Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch 
may be hard to read. But I'm pretty well in the camp that you *have* to 
reformat as you go or the code will be held hostage to the last person who 
*didn't* format properly. I'm pretty sure I'm using the right codestyle.xml 
file, but let me know if not.
7> I doubt that this has any bearing on, say, SolrJ indexing. Should that be 
another bug (or is there one already)? Anybody got a clue where I'd look for 
that since I'm in the area anyway?

Erick

> XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
> --------------------------------------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Will Johnson
>            Assignee: Erick Erickson
>             Fix For: Next
>
>         Attachments: SOLR-445.patch, solr-445.xml
>
>
> Has anyone run into the problem of handling bad documents / failures mid 
> batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it 
> should either fail the entire batch or log a message/return a code and then 
> continue on to add doc 3.  Option 1 would seem to be much harder to 
> accomplish and possibly require more memory while Option 2 would require more 
> information to come back from the API.  I'm about to dig into this but I 
> thought I'd ask to see if anyone had any suggestions, thoughts or comments.   
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to