enhance solr to support per-document results in batch mode
----------------------------------------------------------

                 Key: SOLR-3018
                 URL: https://issues.apache.org/jira/browse/SOLR-3018
             Project: Solr
          Issue Type: Improvement
          Components: clients - java
    Affects Versions: 4.0
         Environment: any
            Reporter: Rob Tulloh


It would be useful to have Solr return per-document results instead of a 
generic SolrException when multiple documents are being passed via 
CommonsHttpSolrServer.The API supports adding multiple streams/files to a 
request (see SOLR-3010 for an example usage in jython) but when an error is 
detected, an exception is returned to the caller and the caller must then 
determine which document failed to be processed. This is particularly 
problematic for simple document extraction when using solr and tika to 
pre-process documents for indexing. In this case, a batch of documents is 
passed to solr for processing by tika. If any of the documents fails to be 
processed, a SolrException is thrown:

{noformat}
Mon Jan  9 18:04:50 2012 Caught SolrException handling documents [13356414, 
23590833, 33917483] (<jclass org.apache.solr.common.SolrException 9>, 
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
TIKA-198: Illegal IOException from 
org.apache.tika.parser.microsoft.TNEFParser@6d893ae8  
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: 
TIK
{noformat}

Instead of this exception, the API could be configured to return a response 
that has a result per-document indicating the server's response for processing 
of the batch. A caller could then check the response and extract the relevant 
parsed content for successful documents and do special handling for documents 
that failed to be parsed.

There are reasonable workarounds for this in the current product. First, 
callers can pass 1 document at a time for processing and then there is no 
ambiguity on what the result is for a document. Another approach is to pass a 
small batch of documents to Solr/Tika and if an exception is thrown, reprocess 
the documents one at a time. If the corpus of documents is largely 
well-behaved, minimal retries will be needed to reprocess failures.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to