[
https://issues.apache.org/jira/browse/PDFBOX-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648342#action_12648342
]
Brian Carrier commented on PDFBOX-374:
--------------------------------------
After reviewing the patches PDFBOX-363 and finding some more examples that were
not fixed by the previous patch in this entry, a new patch is attached. Note:
the landscape_rot90.pdf file that was later attached to PDFBOX-363 is an
example that is not solved by the previous patch, but is solved by this patch.
This patch moves all knowledge about page rotation and the direction of text to
the TextPosition class. The text matrix is now relied on instead of the page
rotation value. New APIs were added so that callers could get text direction
adjusted coordinates. The functionality of the original APIs is maintained for
other parts of PDFBox. Other code was adjusted accordingly. I also did some
cleanup in PDFStreamEngine and PDFTextStripper to remove unused variables and
rename some variables to make their contents easier to understand.
There are some failures on the regression tests, but most of them are better:
- The two mismatches in "hexnumberproblem.pdf" are because the new code
produces better output.
- The mismatches in ocalc.pdf are all because the new code produces better
output.
- The mismatches in test_rotate_270.pdf are because the new code put "t" on its
own line and caused every line after it to fail. The previous version of the
code produced better results in this case, but it is not clear how. The text is
on an angle relative to the other text and the height of "t" is such that it is
equivalent to being on another line of text. I tried to adjust the code so that
it was more liberal with making new lines, but it caused lots of other failures
in the regression tests.
Note that the regression tests do not currently sort the text based on
location, so the page rotation issues are not tested. New regression tests
must be created.
> text areas not properly being sorted because of page rotation
> -------------------------------------------------------------
>
> Key: PDFBOX-374
> URL: https://issues.apache.org/jira/browse/PDFBOX-374
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Reporter: Brian Carrier
> Attachments: rotation.pdf, text-rotation-081117.zip
>
>
> When PDFTextStripper is set to sort the text before outputting, the sorting
> is not correct if a page rotation exists. The reason is because both
> TextPositionComparator and PDFStreamEngine take the rotation into account.
> So, the rotation is applied twice by the time the comparison is done in
> TextPositionComparator.
> Also, it seems that the rotation code in PDFStreamEngine is not consistent. I
> verified the code for 0 and 90 degrees works, but the 180 and 270 situations
> do not seem consistent with the goal of adjusting the X and Y values so that
> 0,0 is in the upper left, which is what the 0 and 90 code does. I do not
> have examples of 180 and 270 to test with. There are no comments in this
> section, so I have been guessing about its purpose.
> The attached patches:
> - Remove the rotation from TextPositionComparator
> - Adds comments and makes changes to the 180 and 270 situations to make it
> consistent with 0 and 90.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.