[
https://issues.apache.org/jira/browse/TIKA-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834681#comment-15834681
]
Tim Allison commented on TIKA-2249:
-----------------------------------
bq. So when Tika claims to parse pdf to HTML and in the resultant HTML if
tables are not preserved i.e. no table tags then isn't it wrong?
This hinges on "tables are not preserved." There are no tables in the
underlying PDF document to be preserved. There are instructions for where to
draw lines and where to place text on the page. Humans can very easily see
tables. It takes computation that Tika doesn't currently apply to "guess"
where the tables are and what the rows/columns contain.
We should update our documentation to acknowledge this limitation, and if you
can recommend an Apache 2.0-friendly package that extracts tables, we can work
towards integrating that functionality (perhaps grobid+Tabula?).
MSWord documents and rtf documents, by contrast, store tables as objects that
we can programmatically process and "preserve" the structure of those tables.
> Tika not able to parse tables from pdf
> ---------------------------------------
>
> Key: TIKA-2249
> URL: https://issues.apache.org/jira/browse/TIKA-2249
> Project: Tika
> Issue Type: Bug
> Components: handler
> Reporter: Amit Kumar
> Attachments: Japanese.pdf
>
>
> Tika not able to parse tables from pdf. I want to attach sample pdf which I
> tried but attachment/browse link is not visible to me.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)