TikiEntityProcessor onError not working in some cases
-----------------------------------------------------
Key: SOLR-2896
URL: https://issues.apache.org/jira/browse/SOLR-2896
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 3.4
Environment: Windows 7, JDK 1.6.0_18, Solr 3.4.0
Reporter: David Webb
When using the TikaEntityProcessor, I can a particular document (attached for
testing) that causes a TikaException. If the onError parameter of the
TikaEntityProcessor is set to "skip" or "continue", the DIH still aborts and
rolls back the entire indexing process.
{code:title=data-config.xml snippet}
<entity name="attach" onError="skip"
query = "select filename, filedata from table where id
= ${parentEntity.ID}"
<field column="filename" name="filename"/>
<entity dataSource="f2" processor="TikaEntityProcessor" url="filedata"
dataField="attach.FILEDATA" format="text">
<field column="text" name="filedata" />
</entity>
</entity>
{code}
{code}
Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
read content Processing Document # 562
at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@8a799a
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
at
org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:315)
at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
at
org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:429)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:419)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 11 more
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]