On 2 Feb 2005, [EMAIL PROTECTED] wrote:

On 1 Feb 2005 13:57:18 -0500, Ted Zlatanov <[EMAIL PROTECTED]> wrote:

>> 2 levels:
>> k/ka/karolis
>> t/tz/tzz
>> 
>> and so on...  The advantage is that you'll get a fairly even
>> distribution, and you know exactly where to find any user's Maildir
>> (with numeric subdirectories, you have to hunt for it or keep a
>> database).
> 
> I currently use this method for some data I store, and it works nice
> but I don't think I would consider it very balanced. I am using sample
> data from our customer base and as you would expect hashing this way
> by domain and/or email address gets very un-even when dealing with
> names people select.

Yes, absolutely.  It's not as good as a hash function of any sort.
The advantage, obviously, is to QUICKLY locate any user ID manually.

> Mainly common letters (like vowels) fill much faster then say w, x, z.
> I am probably going to impliment a tracked integer id per email
> address and fill the hash in reverse order. Since the last digit of
> the id increments it will naturally spread accross the hash. The only
> factor that will impact the balance is deletion at that point.

> This has a lot of drawbacks because of the need to track the
> maildirId, and manage the increment of it. Does anyone else have any
> other methods they use, and would share?

Why not just do MD5 hashing of the name and be done with it?  I would
expect any home-grown hashing scheme to be less capable.  You also
don't need to track the currently allocated ID.  Just make sure you
can handle more than one user per MD5 hash, but your distribution will
be very close to ideal because MD5 collisions are so unlikely.  Then
just break things up by the first N characters of the hash:

The big advantage is your tracking utility is reduced to this Perl code:

# (yes, I know there's a module for MD5)
$tracking_id = `md5 $user_id`; # let's say 467834ab4...

$dirname = $root . substr($tracking_id, 1, 1);

Ted

Reply via email to