Hi all,
I am a Python novice, and right now I would be happy to simply get my job
done with it, but I could appreciate some thoughts on the issue below.

I need to assign one of four numbers to names in a list. The assignment
should be pseudo-random: no pattern whatsoever, but deterministic,
reproducible, and close to uniform. My understanding was that hash functions
would do the job. As I only needed 2 bits of treatment, I picked a byte of
the hashes generated, and even taken mod 4 of it. See the code below.

After I have written a short Python script that hashes my textfile line by
line and collects the numbers next to the original, I checked what I got.
Instead of getting around 25% in each treatment, the range is 17.8%-31.3%. I
understand that the pseudo-randomness means that the treatments should not
be neat and symmetric. Still, this variation is unacceptable for my purpose.
My understanding was that good hash functions generate numbers that look
completely random, and picking only a byte should not change that. I thought
the promise was also to get close to uniformity:
http://en.wikipedia.org/wiki/Hash_function#Uniformity. I tried all the
hashes in the hashlib module, and picked bytes from the beginning and the
end of the hashes, but treatments never were close to uniform (curiously,
always the last treatment seems to be too rare).

Maybe it is an obvious CS puzzle, I'm looking forward to standing corrected.

Thanks!

Laszlo

The script:

#! /usr/bin/python

f = open('names.txt', 'r')
g = open('nameshashed.txt', 'a')

import hashlib

for line in f:
    line = line.rstrip()
    h = str(hashlib.sha512(line).hexdigest())
    s = line + ',' + str(ord(h[64])%4) + '\n'
    g.write(s),
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to