A few months ago I was trying to extract formatted text from a pdf, and output in a structured format (ideally xml/html). The text attributes I required to be available for each line of text were:
- Paragraph (ie relative location on page) - Font - Font size - Font weight I tried to do this with PDFBox at the time but was unable to. I posted to the mailing list and was told this functionality was not available yet, and I would have to implement it myself. I didn't have the time (and possibly the ability) to do this, so I went with a commercial tool. Has PDFBox now moved on enough for it to be able to do the above out of the box (no pun intended!)? Thanks.