9 files out of ~240k pdfs in govdocs1 had very, very minor differences. None of the differences were actual words.
This table will likely be wrecked, but let me know if you’d like me to post it somewhere: FILE_PATH TOKEN_COUNT_A TOKEN_COUNT_B UNIQUE_TOKEN_COUNT_A UNIQUE_TOKEN_COUNT_B TOP_N_WORDS_A TOP_B_WORDS_B TOP_10_UNIQUE_TOKEN_DIFFS_A TOP_10_UNIQUE_TOKEN_DIFFS_B TOP_10_MORE_IN_A TOP_10_MORE_IN_B DICE_COEFFICIENT OVERLAP 095/095028.pdf 99708 99880 8216 8244 the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 | is: 863 | for: 849 | area: 669 the: 6621 | and: 4111 | of: 3361 | in: 2470 | to: 1792 | a: 1414 | are: 981 | is: 863 | for: 849 | area: 669 bc: 6 | cb: 5 | bm: 3 | ied: 2 | ec: 2 | gi: 1 | fg: 1 | fd: 1 | edbb: 1 | bd: 1 c: 18 | d: 18 | b: 17 | f: 13 | de: 11 | h: 8 | bc: 6 | e: 6 | cb: 5 | m: 5 0.998299 0.999138 167/167852.pdf 38313 39154 6035 6101 wkh: 2000 | ri: 1201 | dqg: 1091 | wr: 907 | d: 776 | lq: 582 | lv: 531 | iru: 494 | h: 411 | 6: 378 wkh: 2035 | ri: 1221 | dqg: 1115 | wr: 922 | d: 792 | lq: 589 | lv: 539 | iru: 509 | h: 417 | 6: 385 dpswrq: 2 | 2uelwlqj: 2 | prghudwh: 1 | odfwlf: 1 | lqiudvwuxfwxuhv: 1 | lplw: 1 | hqdeohv: 1 | gurvskhuh: 1 | 526: 1 | 3krwrphwu: 1 wkh: 35 | dqg: 24 | ri: 20 | d: 16 | iru: 15 | wr: 15 | eh: 12 | plvvlrqv: 12 | 0lfur0dsv: 11 | odxqfk: 11 0.994562 0.989144 552/552762.pdf 157799 157798 8156 8156 the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a: 2392 | that: 1990 | for: 1769 | no: 1759 the: 10333 | and: 4951 | to: 4614 | of: 4531 | comment: 3204 | in: 2935 | a: 2392 | that: 1990 | for: 1769 | no: 1759 s: 1 1 0.999997 575/575190.pdf 1127 1128 260 261 y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19 y: 63 | r: 57 | o: 57 | a: 39 | p: 38 | e: 38 | acs: 24 | l: 19 | i: 19 | n: 19 æ: 1 æ: 1 0.998081 0.999557 660/660406.pdf 2434 2437 1084 1085 the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 | says: 18 | year: 18 the: 117 | a: 86 | to: 65 | of: 59 | and: 54 | in: 53 | for: 38 | with: 28 | says: 18 | year: 18 zat: 1 at: 1 z: 3 | zat: 1 0.999539 0.998973 660/660684.pdf 21803 21776 2268 2268 the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 | in: 350 | disabled: 246 | would: 216 the: 1056 | of: 764 | benefits: 651 | and: 531 | to: 492 | for: 452 | a: 357 | in: 350 | disabled: 246 | would: 216 9:27 1 0.99938 729/729805.pdf 11261 11266 1866 1866 the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a: 170 | internet: 127 | address: 120 the: 500 | and: 456 | to: 327 | ipv6: 320 | of: 318 | in: 177 | for: 177 | a: 170 | internet: 127 | address: 120 z: 5 1 0.999778 792/792201.pdf 1268 1265 255 254 05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 | 6: 24 05: 123 | 06: 78 | 04: 60 | 10: 41 | 8: 39 | 5: 36 | 7: 27 | 12: 27 | 1: 26 | 6: 24 r: 3 r: 3 0.998035 0.998816 999/999419.pdf 18917 18917 1291 1290 0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6: 411 | 8: 408 0: 5920 | 1: 1161 | 2: 957 | 5: 657 | e: 650 | 4: 547 | 9: 436 | 3: 425 | 6: 411 | 8: 408 í9,150: 1 | í8,600: 1 | í13,200: 1 9,150: 1 | 8,600: 1 í13,200: 1 | í8,600: 1 | í9,150: 1 13,200: 1 | 8,600: 1 | 9,150: 1 0.998063 0.999841 -----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, July 08, 2015 7:58 AM To: dev@pdfbox.apache.org Subject: RE: PDFBox 1.8.10 release Done and launched. -----Original Message----- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, July 08, 2015 3:00 AM To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org> Subject: Re: PDFBox 1.8.10 release Am 08.07.2015 um 04:20 schrieb Allison, Timothy B.: > Had to dig into code to make sure that our extension of PDFTextStripper winds > up calling the code that you are interested in. I think it does, so, yes, > all we'd have to do is two builds, one with and one without the change. > > Should I make the change locally or do you plan to commit? Locally would be best, as it is really just 1 line, and I haven't created an issue yet. Tilman > > Thank you! > > -----Original Message----- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, July 07, 2015 3:59 PM > To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org> > Subject: Re: PDFBox 1.8.10 release > > Am 07.07.2015 um 19:16 schrieb Allison, Timothy B.: >> Will create separate wrapper that relies solely on PDFTextStripper instead >> of what we currently do now. Results in a few days... > This sounds like work. Isn't all that is needed to run a version before > the change, one after the change, and display the differences as a table > like you already do? > > Tilman > >> Thank you, Tilman, for pinging me. :) >> >> -----Original Message----- >> From: Andreas Lehmkühler [mailto:andr...@lehmi.de] >> Sent: Thursday, July 02, 2015 2:24 AM >> To: dev@pdfbox.apache.org<mailto:dev@pdfbox.apache.org> >> Subject: Re: PDFBox 1.8.10 release >> >> Hi, >> >>> Tilman Hausherr <thaush...@t-online.de<mailto:thaush...@t-online.de>> hat >>> am 1. Juli 2015 um 21:22 >>> geschrieben: >>> >>> >>> Am 30.06.2015 um 12:20 schrieb Andreas Lehmkühler: >>>> Hi, >>>> >>>> there are again a number of solved issues and I'm thinking about a new >>>> bugfix release. How about a new one next week, maybe later if someone >>>> wants to get some addtional things done before? >>> I have only one thing I'd like to test, with Tim Allison, before a >>> release: there's a line in PDTextStripper >> I'm not in a hurry ... >> >>> if ((wordSpacing == 0) || (wordSpacing == Float.NaN)) >>> >>> however wordSpacing == Float.NaN is always false. So I'd like to find >>> out if there is any difference in using what the developer probably >>> intended, which is >>> >>> if ((wordSpacing == 0) || (|Float.isNaN(|wordSpacing))) >>> >>> (BCC to Tim) >>> >>> Tilman >> BR >> Andreas >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: >> dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> >> For additional commands, e-mail: >> dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: >> dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> >> For additional commands, e-mail: >> dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> > For additional commands, e-mail: > dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> > For additional commands, e-mail: > dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> For additional commands, e-mail: dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org> For additional commands, e-mail: dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>