Antonio Fiol Bonnin wrote: > Thank you, Con, for your very interesting point of view. We were > working on (a) but I have told my team that we will be changing > approach in one hour if they do not see a clear end. > > Other than that, I will look into pdftohtml (is it really html?).
http://pdftohtml.sourceforge.net/ It can produce HTML or XML. The XML is closer in form to the content of the PDF - it has pages containing text with typographic and positional formatting. The HTML has some of the formatting information removed (I think) and some kind of guess-work is used to stick lines of text back into paragraphs.
