Hi all, I'm trying to scrape highlighted text from a large number of pdfs, using PDFBox (and jython). I've come close, but have hit a bit of a wall.
I can access the annotations (PDAnnotationTextMarkup) and get their bounding box in user space coordinates through a getRectangle() method. I then use PDFTextStripperByArea to mark the region and fetch text from it. However, the text I get is from a different portion of the page. This suggests to me that the coordinate systems might be different. If I use the PrintTextLocations example in org.pdfbox.examples.util, it gives every piece of text with its location. The locations given here agree with those I get from scraping text with PDFTextStripperByArea, i.e. they use the same coordinate system. Can anyone suggest the appropriate transform between these systems? Thanks, Lars