[jira] [Commented] (TIKA-2249) Tika not able to parse tables from pdf

Tim Allison (JIRA) Mon, 23 Jan 2017 06:43:01 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834681#comment-15834681
 ]


Tim Allison commented on TIKA-2249:
-----------------------------------

bq. So when Tika claims to parse pdf to HTML and in the resultant HTML if 
tables are not preserved i.e. no table tags then isn't it wrong?

This hinges on "tables are not preserved."  There are no tables in the 
underlying PDF document to be preserved.   There are instructions for where to 
draw lines and where to place text on the page.  Humans can very easily see 
tables.  It takes computation that Tika doesn't currently apply to "guess" 
where the tables are and what the rows/columns contain.

We should update our documentation to acknowledge this limitation, and if you 
can recommend an Apache 2.0-friendly package that extracts tables, we can work 
towards integrating that functionality (perhaps grobid+Tabula?).

MSWord documents and rtf documents, by contrast, store tables as objects that 
we can programmatically process and "preserve" the structure of those tables.

> Tika not able to parse tables from pdf 
> ---------------------------------------
>
>                 Key: TIKA-2249
>                 URL: https://issues.apache.org/jira/browse/TIKA-2249
>             Project: Tika
>          Issue Type: Bug
>          Components: handler
>            Reporter: Amit Kumar
>         Attachments: Japanese.pdf
>
>
> Tika not able to parse tables from pdf. I want to attach sample pdf which I 
> tried but attachment/browse link is not visible to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2249) Tika not able to parse tables from pdf

Reply via email to