Usually in these situations Solr returns a 500 error. The Solr Connector, at one point, used to retry indefinitely when such an error came back, but I believe there were changes to this logic and now it may well abort the job if this happens for more than a few hours straight. This is because the Solr connector has no way of knowing whether the 500 error is due to just a Tika exception on a single document, or something more fundamental being wrong with your Solr configuration.
The big problem is that Solr should not be returning a 500 error just because Tika is unhappy with the document. I believe there is a Solr ticket that describes the problem and requests different handling; you may be able to find it. Karl On Mon, Sep 10, 2012 at 9:47 AM, Maciej Liżewski <[email protected]> wrote: > Hi, > > I have found situation when Solr throws exception that it is not able to > parse specified file, like this: > INFO: [collection1] webapp=/solr path=/update/extract > params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group} > {} 0 269 > 2012-09-10 15:34:50 org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.mp4.MP4Parser@48f9a4c1 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656) > > Now - I can live with that, I do not expect it to index everything, but I > am not sure if Manifold should react the way it is - it just stops indexing > anything more from such job (and in fact it shuts down job execution) where > it should try to index other pending files... Now I must run indexing by > hand, check if everything is ok, when there is such problem - add proper > "exclude" filter (which leads to Manifold does not index this kind of files > at all, but problem could be with only this specific single file), and run > it again. Still - I have to guarantee that it won't fail in future on some > other file... > > Don't you think that Manifold should try to index everything *even* when > there are problems with indexing some documents? > > I am just not sure if this is bug or feature... :)
