I'm trying to find a solution to the text rotation problems that satisfy the current regression tests and that work on some files that I found bugs with. I've a hit a point though where I need insight from people who know more about the graphics and non-text part of PDFBox.

Background:
- Text in PDF files are stored in chunks of one or more characters. The "text matrix" can define if the text goes to the right, left, up, or down. - PDFStreamEngine.showString() takes the text stored in the PDF file and decodes it and saves each chunk in a TextPosition object. The TextPosition object has an X,Y coordinate for the text chunk. - PDFTextStripper.flushText() prints the raw text (which requires sorting the text chunks into the correct order and determining how far apart they are so that extra spaces are added if needed).

As an example of this, I have a page that is a normal landscape document. Internally, the text starts at the lower left, the text direction is "up", and the page is rotated 90.

Problem:
- The step of sorting and outputting text requires knowledge about the page rotation because PDFTextStripper needs to know where the "upper left corner" of the page is and how to sort (via TextPositionComparator). For example, in the previous example the "upper left" presentation corner is really the lower left corner in coordinate space.

There seem to be (at least) three ways to solve this:
1) Store the coordinates in TextPosition in coordinates that are adjusted for the page rotation (this is the original way of doing it). 2) Store the coordinates in TextPosition in the native PDF coordinates and then each user of TextPosition can adjust for the rotation (this is what one of the patches does). 3) Store the native coordinates in TextPosition along with the page rotation value and provide an alternate API that gives the adjusted coordinates.

The only other area where TextPosition.getX() is called is in PageDrawer.showCharacter() in a call to PDFont.drawString(). I can't find many references in PDFBox to page rotation and don't know how the graphics code takes rotation into account and I can't figure out if drawString() is assuming a page rotation adjusted coordinate or not.

Is there an overall design approach in PDFBox with respect to when the rotation should be taken into account? Are any of the above proposed solutions most inline with the rest of the PDFBox code?

thanks,
brian


Reply via email to