I'm trying to find a solution to the text rotation problems that
satisfy the current regression tests and that work on some files that
I found bugs with. I've a hit a point though where I need insight
from people who know more about the graphics and non-text part of
PDFBox.
Background:
- Text in PDF files are stored in chunks of one or more characters.
The "text matrix" can define if the text goes to the right, left, up,
or down.
- PDFStreamEngine.showString() takes the text stored in the PDF file
and decodes it and saves each chunk in a TextPosition object. The
TextPosition object has an X,Y coordinate for the text chunk.
- PDFTextStripper.flushText() prints the raw text (which requires
sorting the text chunks into the correct order and determining how
far apart they are so that extra spaces are added if needed).
As an example of this, I have a page that is a normal landscape
document. Internally, the text starts at the lower left, the text
direction is "up", and the page is rotated 90.
Problem:
- The step of sorting and outputting text requires knowledge about
the page rotation because PDFTextStripper needs to know where the
"upper left corner" of the page is and how to sort (via
TextPositionComparator). For example, in the previous example the
"upper left" presentation corner is really the lower left corner in
coordinate space.
There seem to be (at least) three ways to solve this:
1) Store the coordinates in TextPosition in coordinates that are
adjusted for the page rotation (this is the original way of doing it).
2) Store the coordinates in TextPosition in the native PDF
coordinates and then each user of TextPosition can adjust for the
rotation (this is what one of the patches does).
3) Store the native coordinates in TextPosition along with the page
rotation value and provide an alternate API that gives the adjusted
coordinates.
The only other area where TextPosition.getX() is called is in
PageDrawer.showCharacter() in a call to PDFont.drawString(). I can't
find many references in PDFBox to page rotation and don't know how
the graphics code takes rotation into account and I can't figure out
if drawString() is assuming a page rotation adjusted coordinate or not.
Is there an overall design approach in PDFBox with respect to when
the rotation should be taken into account? Are any of the above
proposed solutions most inline with the rest of the PDFBox code?
thanks,
brian