Am 10.05.2017 um 17:12 schrieb Tilman Hausherr:
Thanks for the test... the sum is still negative, but if we'd ignore the truncated files I bet we'd be positive.

I have downloaded a few of the regressions but won't create issues this time as yesterday's turned out to be duplicates, I'll wait for Andreas next commit and will create issues only if these aren't solved.
I guess the new exception aren't related. I've already created an issue for the first one, PDFBOX-3788 I didn't had a chance to look at the second file. I just tested my fix for the first one and it still fails.

@Andreas - ping me if you didn't keep the "secret" URL.
It isn't that secret as Tim posted it somewhere in this thread ...


Some misc thoughts...

039800.pdf: "refinery's" is a different token than refinery. Shouldn't "refinery's" be three tokens? I mention this because refinery is probably in a dictionary.

Some differences are because of a different treatment of the space in bad fonts. Some were improved, and some now look like this "C I T I E S W I T H O U T D R U G S". There is an open issue about these. It is tricky because if we treat these like 1 word, we'd also lose spaces where we don't want.

commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z

Tilman

Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to