Am 09.07.2015 um 18:25 schrieb Allison, Timothy B.:
9 files out of ~240k pdfs in govdocs1 had very, very minor differences. None
of the differences were actual words.
This table will likely be wrecked, but let me know if you’d like me to post it
somewhere:
Thanks, I think I get it. I can identify the files from what you posted.
Tilman
FILE_PATH
TOKEN_COUNT_A
TOKEN_COUNT_B
UNIQUE_TOKEN_COUNT_A
UNIQUE_TOKEN_COUNT_B
TOP_N_WORDS_A
TOP_B_WORDS_B
TOP_10_UNIQUE_TOKEN_DIFFS_A
TOP_10_UNIQUE_TOKEN_DIFFS_B
TOP_10_MORE_IN_A
TOP_10_MORE_IN_B
DICE_COEFFICIENT
OVERLAP
095/095028.pdf
99708
99880
8216
8244
the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 |
is: 863 | for: 849 | area: 669
the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 |
is: 863 | for: 849 | area: 669
bc: 6 | cb: 5 | bm: 3 | ied: 2 | ec: 2 | gi: 1 | fg: 1 | fd: 1 | edbb: 1 | bd: 1
c: 18 | d: 18 | b: 17 | f: 13 | de: 11 | h: 8 | bc: 6 | e: 6 | cb: 5 | m: 5
0.998299
0.999138
167/167852.pdf
38313
39154
6035
6101
wkh: 2000 | ri: 1201 | dqg: 1091 | wr: 907 | d: 776 | lq: 582 | lv: 531 | iru:
494 | h: 411 | 6: 378
wkh: 2035 | ri: 1221 | dqg: 1115 | wr: 922 | d: 792 | lq: 589 | lv: 539 | iru:
509 | h: 417 | 6: 385
dpswrq: 2 | 2uelwlqj: 2 | prghudwh: 1 | odfwlf: 1 | lqiudvwuxfwxuhv: 1 | lplw:
1 | hqdeohv: 1 | gurvskhuh: 1 | 526: 1 | 3krwrphwu: 1
wkh: 35 | dqg: 24 | ri: 20 | d: 16 | iru: 15 | wr: 15 | eh: 12 | plvvlrqv: 12 |
0lfur0dsv: 11 | odxqfk: 11
0.994562
0.989144
552/552762.pdf
157799
157798
8156
8156
the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a:
2392 | that: 1990 | for: 1769 | no: 1759
the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a:
2392 | that: 1990 | for: 1769 | no: 1759
s: 1
1
0.999997
575/575190.pdf
1127
1128
260
261
y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19
y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19
æ: 1
æ: 1
0.998081
0.999557
660/660406.pdf
2434
2437
1084
1085
the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 |
says: 18 | year: 18
the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 |
says: 18 | year: 18
zat: 1
at: 1
z: 3 | zat: 1
0.999539
0.998973
660/660684.pdf
21803
21776
2268
2268
the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 |
in: 350 | disabled: 246 | would: 216
the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 |
in: 350 | disabled: 246 | would: 216
9:27
1
0.99938
729/729805.pdf
11261
11266
1866
1866
the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a:
170 | internet: 127 | address: 120
the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a:
170 | internet: 127 | address: 120
z: 5
1
0.999778
792/792201.pdf
1268
1265
255
254
05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 |
6: 24
05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 |
6: 24
r: 3
r: 3
0.998035
0.998816
999/999419.pdf
18917
18917
1291
1290
0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6:
411 | 8: 408
0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6:
411 | 8: 408
í9,150: 1 | í8,600: 1 | í13,200: 1
9,150: 1 | 8,600: 1
í13,200: 1 | í8,600: 1 | í9,150: 1
13,200: 1 | 8,600: 1 | 9,150: 1
0.998063
0.999841
-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, July 08, 2015 7:58 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 1.8.10 release
Done and launched.
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Wednesday, July 08, 2015 3:00 AM
To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>
Subject: Re: PDFBox 1.8.10 release
Am 08.07.2015 um 04:20 schrieb Allison, Timothy B.:
Had to dig into code to make sure that our extension of PDFTextStripper winds
up calling the code that you are interested in. I think it does, so, yes, all
we'd have to do is two builds, one with and one without the change.
Should I make the change locally or do you plan to commit?
Locally would be best, as it is really just 1 line, and I haven't
created an issue yet.
Tilman
Thank you!
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, July 07, 2015 3:59 PM
To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>
Subject: Re: PDFBox 1.8.10 release
Am 07.07.2015 um 19:16 schrieb Allison, Timothy B.:
Will create separate wrapper that relies solely on PDFTextStripper instead of
what we currently do now. Results in a few days...
This sounds like work. Isn't all that is needed to run a version before
the change, one after the change, and display the differences as a table
like you already do?
Tilman
Thank you, Tilman, for pinging me. :)
-----Original Message-----
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Thursday, July 02, 2015 2:24 AM
To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>
Subject: Re: PDFBox 1.8.10 release
Hi,
Tilman Hausherr <thaush...@t-online.de<mailto:thaush...@t-online.de>> hat am 1.
Juli 2015 um 21:22
geschrieben:
Am 30.06.2015 um 12:20 schrieb Andreas Lehmkühler:
Hi,
there are again a number of solved issues and I'm thinking about a new
bugfix release. How about a new one next week, maybe later if someone
wants to get some addtional things done before?
I have only one thing I'd like to test, with Tim Allison, before a
release: there's a line in PDTextStripper
I'm not in a hurry ...
if ((wordSpacing == 0) || (wordSpacing == Float.NaN))
however wordSpacing == Float.NaN is always false. So I'd like to find
out if there is any difference in using what the developer probably
intended, which is
if ((wordSpacing == 0) || (|Float.isNaN(|wordSpacing)))
(BCC to Tim)
Tilman
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org