[EMAIL PROTECTED] wrote:
Wizards,
Those who've answered my previous questions about hashes have been most
helpful. I especially appreciated John W. Kennedy's excellent card
catalogue cabinet analogy.
I have one last (I hope) question on the topic:I understand from
previous answers that not all data sets will hash "politely." Some will
have a very nice one-item-to-one-bucket outcome, while others can have
several, even many, items in just a few buckets. And that the worst case
could be "10,000 items all in one bucket." Is there a way of
determining how well my data goes into its hash? As an example of my
question:
$h{$_} = $_ for 'a'..'zzz';
print scalar( %h ), "\n";
prints out
14056/32768
This says that the 18,278 values generated used only 14,056 of the
reserved 32,768 buckets. I understand that permutations would hash to
the same key (er, the same bucket?), for example "cab" and "abc"
(although I suspect this would depend on the hashing algorithm, wouldn't
it?). My question here is simply if there's a way to see how
"well-behaved" my data set is. Some of my scalars come out to:
59/128 -- for 69 values
78/128 -- for 120 values
In fact, none of them come out "okay" -- that is, x buckets for x values.
I guess a second question might be whether the answer to the first is
even meaningful--at least insofar as this has any effect on performance.
If it doesn't, I suppose the question's almost pointless, except as a
point of knowledge.
In order to really know how effective a hash algorithm is, it is
necessary to have statistics on chain lengths. I don't believe Perl
gives you that, but here is a program that will simulate it for you.
use strict;
use warnings;
use integer;
# Initialize the buckets to n zeros.
sub init_buckets($$) {
my ($buckets, $size) = @_;
@$buckets = ();
foreach (1..$size) {
push @$buckets, 0;
}
}
# Calculate the raw hash value using the documented Perl algorithm.
sub rawhash ($) {
my ($key) = @_;
my $hash = 0;
for (my $i = 0; $i < length $key; ++$i) {
$hash = $hash * 33 + ord(substr($_, $i, 1));
}
return $hash + ($hash >> 5);
}
# Simulate adding (or re-adding) the data.
sub add_key ($$) {
my ($buckets, $key) = @_;
my $hash = rawhash ($key);
++$$buckets[rawhash ($key) & $#$buckets];
}
# All hashes start with a minimum of eight buckets. Initialize them.
my $size = 8;
my @buckets = ();
init_buckets [EMAIL PROTECTED], $size;
# Since we are not creating a real hash, create an array for use in
# re-hashing.
my @save = ();
# Prepare a count of items added.
my $items = 0;
# This is where the work is done, reading input.
while (<>) {
# If we have more data than buckets, double the number of
# buckets and re-hash all the existing data.
if (++$items > $size) {
$size *= 2;
init_buckets [EMAIL PROTECTED], $size;
foreach (@save) {
add_key [EMAIL PROTECTED], $_;
}
}
# Save the input for re-hashing.
push @save, $_;
# Simulate adding the new item.
add_key [EMAIL PROTECTED], $_;
}
# Show the result.
print join ', ', @buckets;
--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs