On 2014-05-22, Chris Angelico wrote:
> On Thu, May 22, 2014 at 11:41 PM, Adam Funk <a24...@ducksburg.com> wrote:
>> On further reflection, I think I asked for that. In fact, the table
>> I'm using only has one column for the hashes --- I wasn't going to
>> store the strings at all in order to save disk space (maybe my mind is
>> stuck in the 1980s).
> That's a problem, then, because you will see hash collisions. Maybe
> not often, but they definitely will occur if you have enough strings
> (look up the birthday paradox - with a 32-bit arbitrarily selected
> integer (such as a good crypto hash that you then truncate to 32
> bits), you have a 50% chance of a collision at just 77,000 strings).
Ah yes, there's a handy table for that:
> Do you have enough RAM to hold all the strings directly? Just load 'em
> all up into a Python set. Set operations are fast, clean, and easy.
> Your already_seen function becomes a simple 'in' check. These days you
> can get 16GB or 32GB of RAM in a PC inexpensively enough; with an
> average string size of 80 characters, and assuming Python 3.3+, that's
> about 128 bytes each - close enough, and a nice figure. 16GB divided
> by 128 gives 128M strings - obviously you won't get all of that, but
> that's your ball-park. Anything less than, say, a hundred million
> strings, and you can dump the lot into memory. Easy!
Good point, & since (as I explained in my other post) the substrings
are being deduplicated in their own table anyway it's probably not
worth bothering with persistence between runs for this bit.
Some say the world will end in fire; some say in segfaults.