Hi, I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to text conversion, and I found the second one a little clearer. For example, the bottom lines in a PDF (Copyrights, etc) were combined into one line by the PDFBox conversion, and had three separate pieces in the other way.
I am using the last stable PDFBox jar, which dates back to 2006, and the pdftohtml utility is from about the same time, so I can understand this. My question then is twofold: does the comparison make sense, and should I use the pdftohtml combined with text converter, or should I try to build the latest from SVN? Thank you, Mark