I think the preferred solution at the moment is to use the "ignoreTikaException" flag in the update/extract portion of your "solrconfig.xml" configuration.
Having used this in anger, I can confirm is does successfully allow document ingestion to continue where Tika parse errors have occurred. HTH, Adrian -----Original Message----- From: Maciej Liżewski [mailto:[email protected]] Sent: 10 September 2012 14:48 To: [email protected] Subject: question about error handling during indexing Hi, I have found situation when Solr throws exception that it is not able to parse specified file, like this: INFO: [collection1] webapp=/solr path=/update/extract params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group} {} 0 269 2012-09-10 15:34:50 org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.mp4.MP4Parser@48f9a4c1 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656) Now - I can live with that, I do not expect it to index everything, but I am not sure if Manifold should react the way it is - it just stops indexing anything more from such job (and in fact it shuts down job execution) where it should try to index other pending files... Now I must run indexing by hand, check if everything is ok, when there is such problem - add proper "exclude" filter (which leads to Manifold does not index this kind of files at all, but problem could be with only this specific single file), and run it again. Still - I have to guarantee that it won't fail in future on some other file... Don't you think that Manifold should try to index everything *even* when there are problems with indexing some documents? I am just not sure if this is bug or feature... :) ____________________________________________________________ Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
