[
https://issues.apache.org/jira/browse/TIKA-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Louic Vermeer updated TIKA-3032:
--------------------------------
Priority: Minor (was: Critical)
> Table cells below a colspan property are shifted
> ------------------------------------------------
>
> Key: TIKA-3032
> URL: https://issues.apache.org/jira/browse/TIKA-3032
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.23
> Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18
> 18:34:35 UTC 2019 x86_64 GNU/Linux
> openjdk 13.0.2 2020-01-14
> OpenJDK Runtime Environment (build 13.0.2+8)
> OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
> Reporter: Louic Vermeer
> Priority: Minor
> Attachments: table.html
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> When a colspan property is used in html or xml input, cells in the rows below
> the colspan are shifted to the left. Therefore it is no longer possible to
> reconstruct which column the values belong to after being parsing.
> In the attached example, the labels are no longer above the correct column.
> This example was inspired by the tables in the sec filings XBRL data. See for
> example the following link (22MB!) to a 10-K filing:
> https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt
> Suggested solution:
> Tika could insert empty cells behind the cell with the colspan. While this
> may not be perfect, at least it would prevent cells after it from shifting
> position and ending up in the wrong column. The ideal solution (for me at
> least) would be to preserve the colspan information in XML output and to
> insert extra tabs in TXT output to keep the columns aligned.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)