[ 
https://issues.apache.org/jira/browse/TIKA-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting resolved TIKA-665.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

Fixed in revision 1126568.

The getAddress() call on a HyperLinkRecord was returning null for some link 
within the spreadsheet, so I simply added a check for that. Not sure if this is 
something that can/should be fixed in POI or if it's OK for the return value to 
be null.

Note that there seems to be some extra debug output coming to System.out from 
within POI when I parse this file. It would be nice if that could be avoided.



> NullPointerException from 
> com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString on some 
> excel files from the CLI
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-665
>                 URL: https://issues.apache.org/jira/browse/TIKA-665
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: hyperlink_excel2001.xls
>
>
> I've discovered that a small number of excel files (and possibly others, 
> though I haven't noticed any) will cause 
> com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString to blow 
> up with a NPE. The text being passed through from the Excel parser looks fine 
> though.
> The full stacktrace when run from the CLI is:
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@bf7916
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:340)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: java.lang.NullPointerException
>       at 
> com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1966)
>       at 
> com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1946)
>       at 
> com.sun.org.apache.xml.internal.serializer.ToStream.closeStartTag(ToStream.java:2429)
>       at 
> com.sun.org.apache.xml.internal.serializer.ToStream.characters(ToStream.java:1381)
>       at 
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.characters(TransformerHandlerImpl.java:172)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:167)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>       at 
> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
>       at 
> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
>       at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)
>       at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:287)
>       at org.apache.tika.parser.microsoft.TextCell.render(TextCell.java:35)
>       at 
> org.apache.tika.parser.microsoft.CellDecorator.render(CellDecorator.java:34)
>       at 
> org.apache.tika.parser.microsoft.LinkedCell.render(LinkedCell.java:36)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processExtraText(ExcelExtractor.java:423)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processSheet(ExcelExtractor.java:522)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:346)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:297)
>       at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147)
>       at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:276)
>       at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:136)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:206)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 5 more
> Looking at the excel parser code, it seems that we're not doing anything 
> wrong, so I think the issue is with the SAX stuff used by the CLI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to