Annie Didier created TIKA-2646:
----------------------------------

             Summary: Tika parse["content"] returns jumbled text across cells 
of a table in a pdf
                 Key: TIKA-2646
                 URL: https://issues.apache.org/jira/browse/TIKA-2646
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18
         Environment: MacOS Sierra 10.12.6
            Reporter: Annie Didier


When text from a table is extracted, sometimes the order of the cells becomes 
mixed and the words get concatenated together. For example:

 
||HOURS||DUR
(hr)||PHASE||CODE||SUB||DESCRIPTION||

becomes: Hours Dur Code Sub DescriptionPhase

 

In other more serious cases, the text within a cell becomes scrambled with a 
text from another cell. Such as:
||HOURS||DUR
(hr)||PHASE||CODE||SUB||
|00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
TESTING|E - RIG OUT
TESTERS|

the second row becomes:

17.00-00:00 17:00 FLOWBK E - RIG OUT

 

TESTERS

 

33 P -

 

FLOWBACK /

 

TESTING

Note that the value of the second column has moved to the first column, and the 
"-" within the first column is misordered. The last two columns have switched 
places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to