Am 09.07.2015 um 18:25 schrieb Allison, Timothy B.:
9 files out of ~240k pdfs in govdocs1 had very, very minor differences.  None 
of the differences were actual words.



This table will likely be wrecked, but let me know if you’d like me to post it 
somewhere:
Thanks, I think I get it. I can identify the files from what you posted.

Tilman


FILE_PATH

TOKEN_COUNT_A

TOKEN_COUNT_B

UNIQUE_TOKEN_COUNT_A

UNIQUE_TOKEN_COUNT_B

TOP_N_WORDS_A

TOP_B_WORDS_B

TOP_10_UNIQUE_TOKEN_DIFFS_A

TOP_10_UNIQUE_TOKEN_DIFFS_B

TOP_10_MORE_IN_A

TOP_10_MORE_IN_B

DICE_COEFFICIENT

OVERLAP

095/095028.pdf

99708

99880

8216

8244

the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 | 
is: 863 | for: 849 | area: 669

the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 | 
is: 863 | for: 849 | area: 669

bc: 6 | cb: 5 | bm: 3 | ied: 2 | ec: 2 | gi: 1 | fg: 1 | fd: 1 | edbb: 1 | bd: 1

c: 18 | d: 18 | b: 17 | f: 13 | de: 11 | h: 8 | bc: 6 | e: 6 | cb: 5 | m: 5

0.998299

0.999138

167/167852.pdf

38313

39154

6035

6101

wkh: 2000 | ri: 1201 | dqg: 1091 | wr: 907 | d: 776 | lq: 582 | lv: 531 | iru: 
494 | h: 411 | 6: 378

wkh: 2035 | ri: 1221 | dqg: 1115 | wr: 922 | d: 792 | lq: 589 | lv: 539 | iru: 
509 | h: 417 | 6: 385

dpswrq: 2 | 2uelwlqj: 2 | prghudwh: 1 | odfwlf: 1 | lqiudvwuxfwxuhv: 1 | lplw: 
1 | hqdeohv: 1 | gurvskhuh: 1 | 526: 1 | 3krwrphwu: 1

wkh: 35 | dqg: 24 | ri: 20 | d: 16 | iru: 15 | wr: 15 | eh: 12 | plvvlrqv: 12 | 
0lfur0dsv: 11 | odxqfk: 11

0.994562

0.989144

552/552762.pdf

157799

157798

8156

8156

the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a: 
2392 | that: 1990 | for: 1769 | no: 1759

the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a: 
2392 | that: 1990 | for: 1769 | no: 1759

s: 1

1

0.999997

575/575190.pdf

1127

1128

260

261

y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19

y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19

æ: 1

æ: 1

0.998081

0.999557

660/660406.pdf

2434

2437

1084

1085

the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 | 
says: 18 | year: 18

the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 | 
says: 18 | year: 18

zat: 1

at: 1

z: 3 | zat: 1

0.999539

0.998973

660/660684.pdf

21803

21776

2268

2268

the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 | 
in: 350 | disabled: 246 | would: 216

the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 | 
in: 350 | disabled: 246 | would: 216

9:27

1

0.99938

729/729805.pdf

11261

11266

1866

1866

the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a: 
170 | internet: 127 | address: 120

the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a: 
170 | internet: 127 | address: 120

z: 5

1

0.999778

792/792201.pdf

1268

1265

255

254

05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 | 
6: 24

05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 | 
6: 24

r: 3

r: 3

0.998035

0.998816

999/999419.pdf

18917

18917

1291

1290

0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6: 
411 | 8: 408

0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6: 
411 | 8: 408

í9,150: 1 | í8,600: 1 | í13,200: 1

9,150: 1 | 8,600: 1

í13,200: 1 | í8,600: 1 | í9,150: 1

13,200: 1 | 8,600: 1 | 9,150: 1

0.998063

0.999841






-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, July 08, 2015 7:58 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 1.8.10 release



Done and launched.



-----Original Message-----

From: Tilman Hausherr [mailto:thaush...@t-online.de]

Sent: Wednesday, July 08, 2015 3:00 AM

To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>

Subject: Re: PDFBox 1.8.10 release



Am 08.07.2015 um 04:20 schrieb Allison, Timothy B.:

Had to dig into code to make sure that our extension of PDFTextStripper winds 
up calling the code that you are interested in.  I think it does, so, yes, all 
we'd have to do is two builds, one with and one without the change.
Should I make the change locally or do you plan to commit?


Locally would be best, as it is really just 1 line, and I haven't

created an issue yet.



Tilman



Thank you!
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, July 07, 2015 3:59 PM
To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>
Subject: Re: PDFBox 1.8.10 release
Am 07.07.2015 um 19:16 schrieb Allison, Timothy B.:
Will create separate wrapper that relies solely on PDFTextStripper instead of 
what we currently do now.  Results in a few days...
This sounds like work. Isn't all that is needed to run a version before
the change, one after the change, and display the differences as a table
like you already do?
Tilman
Thank you, Tilman, for pinging me. :)
-----Original Message-----
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Thursday, July 02, 2015 2:24 AM
To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org>
Subject: Re: PDFBox 1.8.10 release
Hi,
Tilman Hausherr <thaush...@t-online.de<mailto:thaush...@t-online.de>> hat am 1. 
Juli 2015 um 21:22
geschrieben:
Am 30.06.2015 um 12:20 schrieb Andreas Lehmkühler:
Hi,
there are again a number of solved issues and I'm thinking about a new
bugfix release. How about a new one next week, maybe later if someone
wants to get some addtional things done before?
I have only one thing I'd like to test, with Tim Allison, before a
release: there's a line in PDTextStripper
I'm not in a hurry ...
if ((wordSpacing == 0) || (wordSpacing == Float.NaN))
however wordSpacing == Float.NaN is always false. So I'd like to find
out if there is any difference in using what the developer probably
intended, which is
if ((wordSpacing == 0) || (|Float.isNaN(|wordSpacing)))
(BCC to Tim)
Tilman
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>




---------------------------------------------------------------------

To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>

For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>





---------------------------------------------------------------------

To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>

For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to