text areas not properly being sorted because of page rotation
-------------------------------------------------------------

                 Key: PDFBOX-374
                 URL: https://issues.apache.org/jira/browse/PDFBOX-374
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
            Reporter: Brian Carrier


When PDFTextStripper is set to sort the text before outputting, the sorting is 
not correct if a page rotation exists.  The reason is because both 
TextPositionComparator and PDFStreamEngine take the rotation into account.  So, 
the rotation is applied twice by the time the comparison is done in 
TextPositionComparator. 

Also, it seems that the rotation code in PDFStreamEngine is not consistent. I 
verified the code for 0 and 90 degrees works, but the 180 and 270 situations do 
not seem consistent with the goal of adjusting the X and Y values so that 0,0 
is in the upper left, which is what the 0 and 90 code does.  I do not have 
examples of 180 and 270 to test with. There are no comments in this section, so 
I have been guessing about its purpose.

The attached patches:
- Remove the rotation from TextPositionComparator
- Adds comments and makes changes to the 180 and 270 situations to make it 
consistent with 0 and 90. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to