[
https://issues.apache.org/jira/browse/TIKA-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Louic Vermeer updated TIKA-3032:
--------------------------------
Description:
When a colspan property is used in html or xml input, cells to the right of the
colspan are shifted to the left. Therefore the structure of the table gets
compromised, and it is no longer possible to reconstruct which cells belong to
which column.
In the attached example, the labels are no longer above the correct column.
This example was inspired by the tables in the sec filings XBRL data. See for
example the following link (22MB!) to a 10-K filing:
[https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt]
Suggested solution:
Tika could insert empty cells behind the cell with the colspan. While this may
not be perfect, at least it would prevent cells after it from shifting position
and ending up in the wrong column. The ideal solution (for me at least) would
be to preserve the colspan information in XML output and to insert extra tabs
in TXT output to keep the columns aligned.
was:
When a colspan property is used in html or xml input, cells in the rows below
the colspan are shifted to the left. Therefore it is no longer possible to
reconstruct which column the values belong to after being parsing.
In the attached example, the labels are no longer above the correct column.
This example was inspired by the tables in the sec filings XBRL data. See for
example the following link (22MB!) to a 10-K filing:
https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt
Suggested solution:
Tika could insert empty cells behind the cell with the colspan. While this may
not be perfect, at least it would prevent cells after it from shifting position
and ending up in the wrong column. The ideal solution (for me at least) would
be to preserve the colspan information in XML output and to insert extra tabs
in TXT output to keep the columns aligned.
> Table cells below a colspan property are shifted
> ------------------------------------------------
>
> Key: TIKA-3032
> URL: https://issues.apache.org/jira/browse/TIKA-3032
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.23
> Environment: Linux neon 5.3.18-1-MANJARO #1 SMP PREEMPT Wed Dec 18
> 18:34:35 UTC 2019 x86_64 GNU/Linux
> openjdk 13.0.2 2020-01-14
> OpenJDK Runtime Environment (build 13.0.2+8)
> OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode)
> Reporter: Louic Vermeer
> Priority: Minor
> Attachments: table.html
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> When a colspan property is used in html or xml input, cells to the right of
> the colspan are shifted to the left. Therefore the structure of the table
> gets compromised, and it is no longer possible to reconstruct which cells
> belong to which column.
> In the attached example, the labels are no longer above the correct column.
> This example was inspired by the tables in the sec filings XBRL data. See for
> example the following link (22MB!) to a 10-K filing:
> [https://www.sec.gov/Archives/edgar/data/1410636/000141063619000041/0001410636-19-000041.txt]
> Suggested solution:
> Tika could insert empty cells behind the cell with the colspan. While this
> may not be perfect, at least it would prevent cells after it from shifting
> position and ending up in the wrong column. The ideal solution (for me at
> least) would be to preserve the colspan information in XML output and to
> insert extra tabs in TXT output to keep the columns aligned.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)