Usually in these situations Solr returns a 500 error.  The Solr
Connector, at one point, used to retry indefinitely when such an error
came back, but I believe there were changes to this logic and now it
may well abort the job if this happens for more than a few hours
straight.  This is because the Solr connector has no way of knowing
whether the 500 error is due to just a Tika exception on a single
document, or something more fundamental being wrong with your Solr
configuration.

The big problem is that Solr should not be returning a 500 error just
because Tika is unhappy with the document.  I believe there is a Solr
ticket that describes the problem and requests different handling; you
may be able to find it.

Karl


On Mon, Sep 10, 2012 at 9:47 AM, Maciej Liżewski
<[email protected]> wrote:
> Hi,
>
> I have found situation when Solr throws exception that it is not able to
> parse specified file, like this:
> INFO: [collection1] webapp=/solr path=/update/extract
> params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
> {} 0 269
> 2012-09-10 15:34:50 org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
>         at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
>
> Now - I can live with that, I do not expect it to index everything, but I
> am not sure if Manifold should react the way it is - it just stops indexing
> anything more from such job (and in fact it shuts down job execution) where
> it should try to index other pending files... Now I must run indexing by
> hand, check if everything is ok, when there is such problem - add proper
> "exclude" filter (which leads to Manifold does not index this kind of files
> at all, but problem could be with only this specific single file), and run
> it again. Still - I have to guarantee that it won't fail in future on some
> other file...
>
> Don't you think that Manifold should try to index everything *even* when
> there are problems with indexing some documents?
>
> I am just not sure if this is bug or feature... :)

Reply via email to