[ 
https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494783#comment-13494783
 ] 

Nick Burch commented on TIKA-1020:
----------------------------------

The current Tika behaviour is what I'd expected, you're getting text for the 
cells with real values, and things aren't cluttered for the missing cells/rows 
(of which there can be huge numbers in many excel files). I'm not sure we want 
to be putting in cell references, blank cells etc to the html.

If you have specific requirements in this area, eg you're actually wanting to 
generate things like CSV files, then you're best off using Apache POI directly 
yourself which does provide optional ways to detect these missing cells / rows 
and allows you to put in your own logic to handle them as your needs dictate.
                
> Excel 2010 parser missing cell values are not reported resulting in missing 
> columns values
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1020
>                 URL: https://issues.apache.org/jira/browse/TIKA-1020
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: java 1.6 & 1.7 
>            Reporter: Neil Blue
>              Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it 
> is not reported in the sax handler. As a result a missing value can result in 
> unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> <tr><td>A</td><td>B</td><td>C</td><tr>
> <tr><td>1</td><td>2</td><td>3</td><tr>
> <tr><td>4</td><td>6</td><tr>
> <tr><td>7</td><td>8</td><td>9</td><tr>
> {code}
> As a result the handler can detect that the third row as incomplete cell 
> values but it is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference 
> value, which could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
> ***************
> *** 200,206 ****
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              xhtml.startElement("td");
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> --- 200,208 ----
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              AttributesImpl attributes = new AttributesImpl();
> !              attributes.addAttribute(null, "cellRef", "cellRef", null, 
> cellRef) ;
> !              xhtml.startElement("td",attributes);
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to