[
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2646.
-------------------------------
Resolution: Won't Fix
[~adidier] thank you for opening this issue and sharing this with us. PDFs
don't store table structures per se (like MSWord/PPT do), rather they store
coordinates on a page. Tables have to be inferred/reconstructed based on those
coordinates. Neither Apache Tika, nor Apache PDFBox are currently
inferring/reconstructing tables.
You might want to look into https://github.com/tabulapdf/tabula-java (which
uses PDFBox) to extract tables.
If you'd like to reopen this issue and request that we integrate tabula into
Tika, please do so. I'm not sure I'd have the time to do it any time soon, but
someone else may.
> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---------------------------------------------------------------------------
>
> Key: TIKA-2646
> URL: https://issues.apache.org/jira/browse/TIKA-2646
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.18
> Environment: MacOS Sierra 10.12.6
> Reporter: Annie Didier
> Priority: Trivial
> Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes
> mixed and the words get concatenated together. For example:
>
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>
> In other more serious cases, the text within a cell becomes scrambled with a
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK /
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>
> TESTERS
>
> 33 P -
>
> FLOWBACK /
>
> TESTING
> Note that the value of the second column has moved to the first column, and
> the "-" within the first column is misordered. The last two columns have
> switched places.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)