Sam H created TIKA-1947:
---------------------------
Summary: IllegalArgumentException stacktrace in output since POI
update
Key: TIKA-1947
URL: https://issues.apache.org/jira/browse/TIKA-1947
Project: Tika
Issue Type: Bug
Affects Versions: 1.13
Reporter: Sam H
I tried parsing an Excel document, and noticed there was an
IllegalArgumentException stacktrace in the output.
I've traced this back to
https://github.com/apache/tika/commit/25cee54499126de2b90f6bd5bde8de470b422349
Attached you can find my testfile.
This is the output, running 1.13-snapshot as jar
{code}
java -jar tika-app-1.13-SNAPSHOT.jar iae.xlsx
apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in
'_([$Ç-2]\ * #,##0.00_)'
at
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
at
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
at
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
at
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
at
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * \(#,##0.00\);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in
'_([$Ç-2]\ * \(#,##0.00\)'
at
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
at
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
at
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
at
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
at
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * "-"??_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in
'_([$Ç-2]\ * "-"??_)'
at
org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
at
org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
at
org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
at
org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
at
org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
at
org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
at
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
at
org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2016-04-11T13:45:08Z"/>
<meta name="extended-properties:AppVersion" content="15.0300"/>
<meta name="dc:creator" content="nick"/>
<meta name="extended-properties:Company" content=""/>
<meta name="dcterms:created" content="2016-01-05T14:53:37Z"/>
<meta name="Last-Modified" content="2016-04-11T13:45:08Z"/>
<meta name="dcterms:modified" content="2016-04-11T13:45:08Z"/>
<meta name="Last-Save-Date" content="2016-04-11T13:45:08Z"/>
<meta name="protected" content="false"/>
<meta name="meta:save-date" content="2016-04-11T13:45:08Z"/>
<meta name="Application-Name" content="Microsoft Excel"/>
<meta name="modified" content="2016-04-11T13:45:08Z"/>
<meta name="Content-Length" content="9119"/>
<meta name="Content-Type"
content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By"
content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="nick"/>
<meta name="meta:author" content="nick"/>
<meta name="meta:creation-date" content="2016-01-05T14:53:37Z"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="meta:last-author" content="Sam"/>
<meta name="Creation-Date" content="2016-01-05T14:53:37Z"/>
<meta name="resourceName" content="iae.xlsx"/>
<meta name="Last-Author" content="Sam"/>
<meta name="Application-Version" content="15.0300"/>
<meta name="Author" content="nick"/>
<meta name="publisher" content=""/>
<meta name="dc:publisher" content=""/>
<title/>
</head>
<body><div><h1>Sheet1</h1>
<table><tbody><tr> <td>69.99</td></tr>
</tbody></table>
</div>
</body></html>
{code}
The real output is consistent with what I would expect (and with the output
from version 1.12)
I would expect this exception to be handled another way, but not to show up (as
text) in my parsed output.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)