Am 10.05.2017 um 17:12 schrieb Tilman Hausherr:
Thanks for the test... the sum is still negative, but if we'd ignore the
truncated files I bet we'd be positive.
I have downloaded a few of the regressions but won't create issues this time as
yesterday's turned out to be duplicates, I'll wait for Andreas next commit and
will create issues only if these aren't solved.
I guess the new exception aren't related. I've already created an issue for the
first one, PDFBOX-3788
I didn't had a chance to look at the second file. I just tested my fix for the
first one and it still fails.
@Andreas - ping me if you didn't keep the "secret" URL.
It isn't that secret as Tim posted it somewhere in this thread ...
Some misc thoughts...
039800.pdf: "refinery's" is a different token than refinery. Shouldn't
"refinery's" be three tokens? I mention this because refinery is probably in a
dictionary.
Some differences are because of a different treatment of the space in bad fonts.
Some were improved, and some now look like this "C I T I E S W I T H O U T D R U
G S". There is an open issue about these. It is tricky because if we treat these
like 1 word, we'd also lose spaces where we don't want.
commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used
http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z
Tilman
Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]