All,

  Apologies for the delay.  I finally finished the comparison of text extracted 
from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0].  I botched the commit message...

  I haven't had a chance to review the results.  The eval code is still in 
development and there might be bugs! To view the docs, prepend: h t t p : slash 
slash one six two . two four two . two two eight . one seven four/docs/  ... 
just don't let any of the scrapers read that. ;)  The docs include all those 
within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then 
I took a random selection to fill out ~100k pdfs from common crawl and govdocs1.

  Let me know if you have any questions.

          Cheers,

                     Tim


[0] 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip

Reply via email to