[ 
https://issues.apache.org/jira/browse/TIKA-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louic Vermeer updated TIKA-3032:
--------------------------------
    Priority: Minor  (was: Critical)

> Table cells below a colspan property are shifted
> ------------------------------------------------
>
>                 Key: TIKA-3032
>                 URL: https://issues.apache.org/jira/browse/TIKA-3032
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.23
>         Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18 
> 18:34:35 UTC 2019 x86_64 GNU/Linux
> openjdk 13.0.2 2020-01-14
> OpenJDK Runtime Environment (build 13.0.2+8)
> OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
>            Reporter: Louic Vermeer
>            Priority: Minor
>         Attachments: table.html
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When a colspan property is used in html or xml input, cells in the rows below 
> the colspan are shifted to the left. Therefore it is no longer possible to 
> reconstruct which column the values belong to after being parsing.
> In the attached example, the labels are no longer above the correct column. 
> This example was inspired by the tables in the sec filings XBRL data. See for 
> example the following link (22MB!) to a 10-K filing: 
> https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt
> Suggested solution:
> Tika could insert empty cells behind the cell with the colspan. While this 
> may not be perfect, at least it would prevent cells after it from shifting 
> position and ending up in the wrong column. The ideal solution (for me at 
> least) would be to preserve the colspan information in XML output and to 
> insert extra tabs in TXT output to keep the columns aligned.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to