Hi Tim,

Thanks for the report, I fixed five bugs today.

1)
re file commoncrawl2/NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU:

Please test your code why the word "microsoft" is missing. This is in the /Title:
18 0 obj
(Microsoft Word - Water Line Pipe Sizing.docx.doc)
endobj


2)
Could you please rerun the test with the latest trunk, preferably with the same test set? One of the bugs I fixed (PDFBOX-3053) applies to many files. So now I have the problem that "problem" files I test manually no longer miss the tokens mentioned in the report.


Tilman

Am 23.10.2015 um 21:36 schrieb Allison, Timothy B.:
All,

   Apologies for the delay.  I finally finished the comparison of text 
extracted from 100k pdfs with 1.8.10 and 2.0 trunk 
(pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0].  I botched the commit message...

   I haven't had a chance to review the results.  The eval code is still in 
development and there might be bugs! To view the docs, prepend: h t t p : slash 
slash one six two . two four two . two two eight . one seven four/docs/  ... 
just don't let any of the scrapers read that. ;)  The docs include all those 
within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then 
I took a random selection to fill out ~100k pdfs from common crawl and govdocs1.

   Let me know if you have any questions.

           Cheers,

                      Tim


[0] 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to