A few months ago I was trying to extract formatted text from a pdf,
and output in a structured format (ideally xml/html). The text
attributes I required to be available for each line of text were:

- Paragraph (ie relative location on page)
- Font
- Font size
- Font weight

I tried to do this with PDFBox at the time but was unable to. I posted
to the mailing list and was told this functionality was not available
yet, and I would have to implement it myself. I didn't have the time
(and possibly the ability) to do this, so I went with a commercial
tool.

Has PDFBox now moved on enough for it to be able to do the above out
of the box (no pun intended!)?

Thanks.

Reply via email to