Sam H created TIKA-1947:
---------------------------

             Summary: IllegalArgumentException stacktrace in output since POI 
update
                 Key: TIKA-1947
                 URL: https://issues.apache.org/jira/browse/TIKA-1947
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.13
            Reporter: Sam H


I tried parsing an Excel document, and noticed there was an 
IllegalArgumentException stacktrace in the output.

I've traced this back to 
https://github.com/apache/tika/commit/25cee54499126de2b90f6bd5bde8de470b422349

Attached you can find my testfile.

This is the output, running 1.13-snapshot as jar
{code}
java -jar tika-app-1.13-SNAPSHOT.jar iae.xlsx


apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in 
'_([$Ç-2]\ * #,##0.00_)'
        at 
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at 
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at 
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at 
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
 Source)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * \(#,##0.00\);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in 
'_([$Ç-2]\ * \(#,##0.00\)'
        at 
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at 
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at 
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at 
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
 Source)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * "-"??_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in 
'_([$Ç-2]\ * "-"??_)'
        at 
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at 
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at 
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at 
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at 
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at 
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
 Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown 
Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown 
Source)
        at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
 Source)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at 
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at 
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="date" content="2016-04-11T13:45:08Z"/>
<meta name="extended-properties:AppVersion" content="15.0300"/>
<meta name="dc:creator" content="nick"/>
<meta name="extended-properties:Company" content=""/>
<meta name="dcterms:created" content="2016-01-05T14:53:37Z"/>
<meta name="Last-Modified" content="2016-04-11T13:45:08Z"/>
<meta name="dcterms:modified" content="2016-04-11T13:45:08Z"/>
<meta name="Last-Save-Date" content="2016-04-11T13:45:08Z"/>
<meta name="protected" content="false"/>
<meta name="meta:save-date" content="2016-04-11T13:45:08Z"/>
<meta name="Application-Name" content="Microsoft Excel"/>
<meta name="modified" content="2016-04-11T13:45:08Z"/>
<meta name="Content-Length" content="9119"/>
<meta name="Content-Type" 
content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" 
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="nick"/>
<meta name="meta:author" content="nick"/>
<meta name="meta:creation-date" content="2016-01-05T14:53:37Z"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="meta:last-author" content="Sam"/>
<meta name="Creation-Date" content="2016-01-05T14:53:37Z"/>
<meta name="resourceName" content="iae.xlsx"/>
<meta name="Last-Author" content="Sam"/>
<meta name="Application-Version" content="15.0300"/>
<meta name="Author" content="nick"/>
<meta name="publisher" content=""/>
<meta name="dc:publisher" content=""/>
<title/>
</head>
<body><div><h1>Sheet1</h1>
<table><tbody><tr>      <td>69.99</td></tr>
</tbody></table>
</div>
</body></html>
{code}

The real output is consistent with what I would expect (and with the output 
from version 1.12)

I would expect this exception to be handled another way, but not to show up (as 
text) in my parsed output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to