Wrong text extract from vertical textboxes in pdf files
-------------------------------------------------------

                 Key: TIKA-494
                 URL: https://issues.apache.org/jira/browse/TIKA-494
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Win 7, VS 2010 C#
            Reporter: Sandor Dj
            Priority: Critical


Vertical textboxes in pdf files are not extracted correctly (using the tika 
library in c#).
For example if there is a textbox vertical "hello" in a pdf file (!WITHOUT! 
line breaks):

H
E
L
L
O

the parser returns 5 strings, each with a single letter, even there is NO line 
break after every letter.
Is there a option to avoid this problem?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to