All,
Apologies for the delay. I finally finished the comparison of text extracted
from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0]. I botched the commit message...
I haven't had a chance to review the results. The eval code is still in
development and there might be bugs! To view the docs, prepend: h t t p : slash
slash one six two . two four two . two two eight . one seven four/docs/ ...
just don't let any of the scrapers read that. ;) The docs include all those
within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then
I took a random selection to fill out ~100k pdfs from common crawl and govdocs1.
Let me know if you have any questions.
Cheers,
Tim
[0]
https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip