Annie Didier created TIKA-2646:
----------------------------------
Summary: Tika parse["content"] returns jumbled text across cells
of a table in a pdf
Key: TIKA-2646
URL: https://issues.apache.org/jira/browse/TIKA-2646
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.18
Environment: MacOS Sierra 10.12.6
Reporter: Annie Didier
When text from a table is extracted, sometimes the order of the cells becomes
mixed and the words get concatenated together. For example:
||HOURS||DUR
(hr)||PHASE||CODE||SUB||DESCRIPTION||
becomes: Hours Dur Code Sub DescriptionPhase
In other more serious cases, the text within a cell becomes scrambled with a
text from another cell. Such as:
||HOURS||DUR
(hr)||PHASE||CODE||SUB||
|00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK /
TESTING|E - RIG OUT
TESTERS|
the second row becomes:
17.00-00:00 17:00 FLOWBK E - RIG OUT
TESTERS
33 P -
FLOWBACK /
TESTING
Note that the value of the second column has moved to the first column, and the
"-" within the first column is misordered. The last two columns have switched
places.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)