[EMAIL PROTECTED] wrote:
'Tis a puzzlement... Okay, I've got this code:
my %h;
print scalar(%h), "\n";
$h{a} = 1;
print scalar(%h), "\n";
keys %h = 256;
print scalar(%h), "\n";
$h{$_}=$_ for 'a'..'zzz';
print scalar(%h), "\n";
and it prints out:
0
1/8
1/256
14056/32768
So what do those numbers mean? I thought I had it figured out until the
last one. That would've made sense, too, if it'd been 18278, the number
of strings between 'a' and 'zzz'. That would explain the 32768--the next
2**x after 18278. But where'd the 14056 come from?
Come to think of it, shouldn't the second line of output been 1/2? Or is
eight the minimum number of buckets you get for a hash, whether you use
'em or not?
As I remarked in my other message, it skips directly from zero to eight.
As to the apparent anomaly, the short answer (which requires that you
thoroughly understand why it's called a "hash") is that the number is
the number of buckets in use, not the number of items in the hash.
If you do not understand the above, then, in order to understand it,
you're going to have to know how hashing works, and it's not simple.
Hashing was first discovered in the 1950s, and continues to be the best
way to look up random data, provided that you do not also have a need to
read it in sequential order without first sorting it. It is the fastest
known way to access random data in RAM, and it is the fastest known way
to access random data on disk if you can't afford to keep an index of
the disk in RAM, and there are very strong theoretical reasons to
believe that no faster method is possible.
Here's an analogy of how it's done (paraphrased from memory from "The
History of Early IBM Computers", M.I.T. press).
You're inviting a good many people to a dinner party, and letting each
guest order his own meal in advance. Because there are many, many guests
(say a thousand), you've hired waiters for the occasion.
But this raises the question of how the waiters are to find the right
guests. One way you could do it would be to make a paper listing, but
let's say that you don't have a computer (or a typewriter) and you don't
want to write it out by hand, so the waiters will have to be able to do
without a written list. Fortunately, all your waiters just happen to be
lightning calculators, so you come up with this idea.
You give your maitre d'hotel (who is also a lightning calculator) the
following instructions. Given the guest's first and last name, he is to
imagine dialing the letters on a telephone keypad, and add together
first-digit*2, second-digit*3, third-digit*5, fourth-digit*7,
fifth-digit*11, and so on through the prime numbers, and then multiply
the sum by the next prime number. Then he divides the product by 1000,
and takes the remainder, which will be between 0 and 999. He leads the
guest to the seat with that number.
Now, sometimes, there will already be a guest in that seat. (This is
called a "collision".) If that happens, he just goes down the line until
he finds the next empty seat (wrapping around from 999 to 0, of course),
and seats the guest there. Then he goes back to the original seat, and
leaves a paper in a convenient pocket in the back telling where the new
guest is.
If there is already a card in the pocket, then he still goes to the next
empty seat, but leaves a second card (pointing to the third seat) in the
empty pocket of the second seat, and so on.
When a waiter wishes to find a guest, he performs the same calculation
that the maitre d'hotel did, and proceeds to the first seat. If he has
the wrong person, he looks in the card in the back, and proceeds to the
second seat, and so on.
The exact mathematical rule, up to, but not including, the division and
remainder, varies; the example I give above is workable, but not great.
Producing a really good hash algorithm (so called because it "makes a
hash of" the guest's names) is tricky, and, nowadays, it's usually best
to rely on a general-purpose one that professionals have designed than
to try to invent one for yourself.
Believe it or not, for any reasonably large number of items, this is the
fastest way to do it, and, as the number of items gets bigger, the
advantage over other schemes gets bigger and bigger.
Now, to get back to the original question, the number of "buckets" used
is the number of /first/ seats used. Whenever there is a collision, the
number of guests increases by one, but the number of "buckets" used
stays the same.
It is not surprising that the sequence 'a'..'zzz' produced a great many
collisions. Real data will usually do much better.
--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs