[ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389884#comment-16389884 ]
Radim Rehurek edited comment on TIKA-1020 at 3/7/18 5:57 PM: ------------------------------------------------------------- We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] IMO empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records caused by empty cells in a single row. was (Author: piskvorky): We just hit this bug too. I say "bug" because Excel spreadsheets are really structured tables, just like [~arodkin] explained 5 years ago. Interpreting them as unorganized cells makes little sense. [~tpalsulich] IMO empty rows could be reported too, but in our use-case, the critical thing is not to have jumbled records (caused by missing cells in a single row). > Excel 2010 parser missing cell values are not reported resulting in missing > columns values > ------------------------------------------------------------------------------------------ > > Key: TIKA-1020 > URL: https://issues.apache.org/jira/browse/TIKA-1020 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.2 > Environment: java 1.6 & 1.7 > Reporter: Neil Blue > Priority: Major > Labels: newbie, patch > > When parting an excel 2010 table, if a worksheet has a missing value, then it > is not reported in the sax handler. As a result a missing value can result in > unordered data. > For example given the table: > {code:title=Bar.java|borderStyle=solid} > A B B > 1 2 3 > 4 6 > 7 8 9 > {code} > the returned sax handler reports elements > {code:title=Bar.java|borderStyle=solid} > <tr><td>A</td><td>B</td><td>C</td><tr> > <tr><td>1</td><td>2</td><td>3</td><tr> > <tr><td>4</td><td>6</td><tr> > <tr><td>7</td><td>8</td><td>9</td><tr> > {code} > As a result the handler can detect that the third row as incomplete cell > values but it is ambiguous which columns have missing data. > As a possible fix for this excel 2010 xml data contains the cell reference > value, which could be returned to the sax handler as an attribute. > {code:title=Bar.java|borderStyle=solid} > *** XSSFExcelExtractorDecorator.java 2012-11-08 10:51:55.881207100 +0000 > --- XSSFExcelExtractorDecorator.java.1 2012-11-08 10:59:02.972223700 +0000 > *************** > *** 200,206 **** > > public void cell(String cellRef, String formattedValue) { > try { > ! xhtml.startElement("td"); > > // Main cell contents > xhtml.characters(formattedValue); > --- 200,208 ---- > > public void cell(String cellRef, String formattedValue) { > try { > ! AttributesImpl attributes = new AttributesImpl(); > ! attributes.addAttribute(null, "cellRef", "cellRef", null, > cellRef) ; > ! xhtml.startElement("td",attributes); > > // Main cell contents > xhtml.characters(formattedValue); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)