Hi Tim, I've created https://issues.apache.org/jira/browse/PDFBOX-3058 <https://issues.apache.org/jira/browse/PDFBOX-3058> to track our part of fixing issues as part of the test (and later onset come) and added you and Tilman as a watcher.
BR Maruan > Am 23.10.2015 um 21:36 schrieb Allison, Timothy B. <[email protected]>: > > All, > > Apologies for the delay. I finally finished the comparison of text > extracted from 100k pdfs with 1.8.10 and 2.0 trunk > (pdfbox-2.0.0-20151022.051152-1783). > The reports are available here [0]. I botched the commit message... > > I haven't had a chance to review the results. The eval code is still in > development and there might be bugs! To view the docs, prepend: h t t p : > slash slash one six two . two four two . two two eight . one seven four/docs/ > ... just don't let any of the scrapers read that. ;) The docs include all > those within our corpus that had a rtl word (when extracted with 1.8.10 :)) > and then I took a random selection to fill out ~100k pdfs from common crawl > and govdocs1. > > Let me know if you have any questions. > > Cheers, > > Tim > > > [0] > https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip >
