[ https://issues.apache.org/jira/browse/TIKA-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yahav Amsalem updated TIKA-2934: -------------------------------- Attachment: sample.xlsx > OOXML parser fails to parse XLSX files with missing cellRef properties > ---------------------------------------------------------------------- > > Key: TIKA-2934 > URL: https://issues.apache.org/jira/browse/TIKA-2934 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.22 > Reporter: Yahav Amsalem > Priority: Major > Attachments: sample.xlsx > > > A NullPointerException is thrown when parsing xlsx documents that don’t have > CellRef property: > {code:java} > Caused by: java.lang.NullPointerException: null > at org.apache.poi.util.StringUtil.endsWithIgnoreCase(StringUtil.java:317) > at org.apache.poi.ss.util.CellReference.<init>(CellReference.java:109) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:452) > at > org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:379) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:553) > at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) > at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:452) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:352) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:168) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:136) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:122) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:201) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > {code} > According to latest OOXML standard ECMA-376 5th edition Part 1 (released on > December 2016), Cell Reference (18.18.7, ST_CellRef) property on a Cell > (18.3.1.4, CT_Cell) is optional. > Actually, we believe an abandoned pull request was supposed to fix this issue > and it wasn’t merged eventually: > [https://github.com/apache/tika/pull/214/commits/d79aa3baf33d4f859e4daa8ef251721f3ac2a198.] > Look at the safety block commented with: > {code:java} > // gracefully handle missing CellRef here in a similar way as XSSFCell > does{code} > > -- This message was sent by Atlassian Jira (v8.3.2#803003)