[EMAIL PROTECTED] wrote:

'Tis a puzzlement... Okay, I've got this code:

my %h;

print scalar(%h), "\n";
$h{a} = 1;
print scalar(%h), "\n";
keys %h = 256;
print scalar(%h), "\n";
$h{$_}=$_ for 'a'..'zzz';
print scalar(%h), "\n";

and it prints out:

0
1/8
1/256
14056/32768

So what do those numbers mean? I thought I had it figured out until the last one. That would've made sense, too, if it'd been 18278, the number of strings between 'a' and 'zzz'. That would explain the 32768--the next 2**x after 18278. But where'd the 14056 come from?

Come to think of it, shouldn't the second line of output been 1/2? Or is eight the minimum number of buckets you get for a hash, whether you use 'em or not?

As I remarked in my other message, it skips directly from zero to eight.

As to the apparent anomaly, the short answer (which requires that you thoroughly understand why it's called a "hash") is that the number is the number of buckets in use, not the number of items in the hash.

If you do not understand the above, then, in order to understand it, you're going to have to know how hashing works, and it's not simple.

Hashing was first discovered in the 1950s, and continues to be the best way to look up random data, provided that you do not also have a need to read it in sequential order without first sorting it. It is the fastest known way to access random data in RAM, and it is the fastest known way to access random data on disk if you can't afford to keep an index of the disk in RAM, and there are very strong theoretical reasons to believe that no faster method is possible.

Here's an analogy of how it's done (paraphrased from memory from "The History of Early IBM Computers", M.I.T. press).

You're inviting a good many people to a dinner party, and letting each guest order his own meal in advance. Because there are many, many guests (say a thousand), you've hired waiters for the occasion.

But this raises the question of how the waiters are to find the right guests. One way you could do it would be to make a paper listing, but let's say that you don't have a computer (or a typewriter) and you don't want to write it out by hand, so the waiters will have to be able to do without a written list. Fortunately, all your waiters just happen to be lightning calculators, so you come up with this idea.

You give your maitre d'hotel (who is also a lightning calculator) the following instructions. Given the guest's first and last name, he is to imagine dialing the letters on a telephone keypad, and add together first-digit*2, second-digit*3, third-digit*5, fourth-digit*7, fifth-digit*11, and so on through the prime numbers, and then multiply the sum by the next prime number. Then he divides the product by 1000, and takes the remainder, which will be between 0 and 999. He leads the guest to the seat with that number.

Now, sometimes, there will already be a guest in that seat. (This is called a "collision".) If that happens, he just goes down the line until he finds the next empty seat (wrapping around from 999 to 0, of course), and seats the guest there. Then he goes back to the original seat, and leaves a paper in a convenient pocket in the back telling where the new guest is.

If there is already a card in the pocket, then he still goes to the next empty seat, but leaves a second card (pointing to the third seat) in the empty pocket of the second seat, and so on.

When a waiter wishes to find a guest, he performs the same calculation that the maitre d'hotel did, and proceeds to the first seat. If he has the wrong person, he looks in the card in the back, and proceeds to the second seat, and so on.

The exact mathematical rule, up to, but not including, the division and remainder, varies; the example I give above is workable, but not great. Producing a really good hash algorithm (so called because it "makes a hash of" the guest's names) is tricky, and, nowadays, it's usually best to rely on a general-purpose one that professionals have designed than to try to invent one for yourself.

Believe it or not, for any reasonably large number of items, this is the fastest way to do it, and, as the number of items gets bigger, the advantage over other schemes gets bigger and bigger.

Now, to get back to the original question, the number of "buckets" used is the number of /first/ seats used. Whenever there is a collision, the number of guests increases by one, but the number of "buckets" used stays the same.

It is not surprising that the sequence 'a'..'zzz' produced a great many collisions. Real data will usually do much better.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
  -- Charles Williams.  "Judgement at Chelmsford"

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to