I think the preferred solution at the moment is to use the 
"ignoreTikaException" flag in the update/extract portion of your 
"solrconfig.xml" configuration.

Having used this in anger, I can confirm is does successfully allow document 
ingestion to continue where Tika parse errors have occurred.

HTH,

Adrian

-----Original Message-----
From: Maciej Liżewski [mailto:[email protected]] 
Sent: 10 September 2012 14:48
To: [email protected]
Subject: question about error handling during indexing

Hi,

I have found situation when Solr throws exception that it is not able to parse 
specified file, like this:
INFO: [collection1] webapp=/solr path=/update/extract 
params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
{} 0 269
2012-09-10 15:34:50 org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
        at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)

Now - I can live with that, I do not expect it to index everything, but I am 
not sure if Manifold should react the way it is - it just stops indexing 
anything more from such job (and in fact it shuts down job execution) where it 
should try to index other pending files... Now I must run indexing by hand, 
check if everything is ok, when there is such problem - add proper "exclude" 
filter (which leads to Manifold does not index this kind of files at all, but 
problem could be with only this specific single file), and run it again. Still 
- I have to guarantee that it won't fail in future on some other file...

Don't you think that Manifold should try to index everything *even* when there 
are problems with indexing some documents?

I am just not sure if this is bug or feature... :)
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Reply via email to