Hi Tim,
Thanks for the report, I fixed five bugs today.
1)
re file commoncrawl2/NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU:
Please test your code why the word "microsoft" is missing. This is in
the /Title:
18 0 obj
(Microsoft Word - Water Line Pipe Sizing.docx.doc)
endobj
2)
Could you please rerun the test with the latest trunk, preferably with
the same test set? One of the bugs I fixed (PDFBOX-3053) applies to many
files. So now I have the problem that "problem" files I test manually no
longer miss the tokens mentioned in the report.
Tilman
Am 23.10.2015 um 21:36 schrieb Allison, Timothy B.:
All,
Apologies for the delay. I finally finished the comparison of text
extracted from 100k pdfs with 1.8.10 and 2.0 trunk
(pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0]. I botched the commit message...
I haven't had a chance to review the results. The eval code is still in
development and there might be bugs! To view the docs, prepend: h t t p : slash
slash one six two . two four two . two two eight . one seven four/docs/ ...
just don't let any of the scrapers read that. ;) The docs include all those
within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then
I took a random selection to fill out ~100k pdfs from common crawl and govdocs1.
Let me know if you have any questions.
Cheers,
Tim
[0]
https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]