Hi,
I have compared the PDFBox-to-text to the pdftohtml (in Linux) - then to
text conversion, and I found the second one a little clearer. For example,
the bottom lines in a PDF (Copyrights, etc) were combined into one line by
the PDFBox conversion, and had three separate pieces in the other way.

I am using the last stable PDFBox jar, which dates back to 2006, and the
pdftohtml utility is from about the same time, so I can understand this.

My question then is twofold: does the comparison make sense, and should I
use the pdftohtml combined with text converter, or should I try to build the
latest from SVN?

Thank you,
Mark

Reply via email to