Incorrect text extraction when text rotation is not 0,90,180,270
----------------------------------------------------------------

                 Key: PDFBOX-878
                 URL: https://issues.apache.org/jira/browse/PDFBOX-878
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.3.1, 1.4.0
            Reporter: David Rodríguez Alfayate


Currently text extraction only supports 0, 90, 180 or 270 degrees rotation, so 
text rotated in another angle is extracted incorrectly. I attached one simple 
PDF and the text extraction result as output from ExtractText.

I have made a patch for the current revision (1.4.0) in which I consider any 
rotation in the current matrix position. I have had to refactor the considering 
of (0,0) as upper-left since for rotations places outside of the original 
asumption, could happen that a word or a line could be splitted. 

Since we have some needs for a project in we are working, I have made changes 
to the way normalization and line printing is done, in the current codebase the 
normalize function is returing a List of Strings, my changes make this method 
return a List of Words, which are ICU normalized and therefore printed in the 
current writeLine method. In my patch I have also included a sample PDF2XML 
class, which converts the PDF to a XML, managing each word in a separate way.

I submit the test-cases and the patch for your consideration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to