TikiEntityProcessor onError not working in some cases
-----------------------------------------------------

                 Key: SOLR-2896
                 URL: https://issues.apache.org/jira/browse/SOLR-2896
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
    Affects Versions: 3.4
         Environment: Windows 7, JDK 1.6.0_18, Solr 3.4.0
            Reporter: David Webb


When using the TikaEntityProcessor, I can a particular document (attached for 
testing) that causes a TikaException.  If the onError parameter of the 
TikaEntityProcessor is set to "skip" or "continue", the DIH still aborts and 
rolls back the entire indexing process.

{code:title=data-config.xml snippet}
<entity name="attach" onError="skip"
                        query = "select filename, filedata from table where id 
= ${parentEntity.ID}"
        <field column="filename" name="filename"/>
        <entity dataSource="f2" processor="TikaEntityProcessor" url="filedata" 
dataField="attach.FILEDATA" format="text">
               <field column="text" name="filedata" />
        </entity>
</entity>
{code}

{code}
Nov 12, 2011 10:22:16 AM org.apache.solr.common.SolrException log
SEVERE: Full Import 
failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
read content Processing Document # 562
        at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
        at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
        at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
        at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
        at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
        at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
        at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.ParserDecorator$1@8a799a
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
        at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
        ... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 29
        at 
org.apache.poi.hwpf.model.StyleSheet.getCharacterStyle(StyleSheet.java:315)
        at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:60)
        at 
org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
        at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
        at 
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
        at 
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:429)
        at 
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:419)
        at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        ... 11 more

Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Nov 12, 2011 10:22:16 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to