On Wed, 29 Dec 2004 10:49:19 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> On Tue, 2004-12-28 at 13:46, Ian Langworth wrote:
> > On 28.Dec.2004 01:14AM -0500, Tom Metro wrote:
> >
> > > If you are concerned about the performance impact of long
> > > keys, and your application fits a "write-once, read-many"
> > > model, then you could always hash the hash keys. Say generate
> > > an MD5 digest of the key string, and then use the digest as
> > > the hash key.
> >
> > This might make a nice Tie:: module, if there already isn't
> > one. But then again, tie itself is allegedly slow...
> 
> No, that would defeat the point....
> 
> Or at least that's what I was going to say... I had a whole rationale
> typed up, but then I went to benchmark my hypothesis and I get this:
> 
>         $ perl -MBenchmark -e 'my $long="a"x10_000;my %x;timethis(100_000,sub 
> {$x{$long}++});print "Final: $x{$long}\n"'
>         timethis 100000:  7 wallclock secs ( 6.54 usr +  0.00 sys =  6.54 
> CPU) @ 15290.52/s (n=100000)
>         Final: 100000
>         $ perl -MBenchmark -e 'my $long="a"x10_000;my %x;timethis(100_000,sub 
> {my $tmp=unpack("%32C*",$long) % 65535;$x{$tmp}++});my 
> $tmp=unpack("%32C*",$long) % 65535;print "Final: $x{$tmp}\n"'
>         timethis 100000:  2 wallclock secs ( 2.16 usr +  0.00 sys =  2.16 
> CPU) @ 46296.30/s (n=100000)
>         Final: 100000
> 
> Is there a bug in my code, or is there really that substantial a
> savings?

The savings is somewhere between nothing and the ratio of
lengths of string, so it is in the range that I expected.  I see no
obvious errors in your code.  So I'd suspect that you're seeing
what the savings looks like.

> Of course, there's a substantial problem with the above: hashes DO
> conflict. Your module would have to do the same conflict resolution that
> perl's built-in hashing would do, and that's probably where the extra
> overhead comes in (though I admit I'm not seeing it... perhaps in
> comparing the long value to the original?)

Think about what Perl has to do to do a hash lookup.

1. Compute a hash value.  This is a calculation that goes
  character by character, and hence takes an amount of
  time that is proportional to the length of the key.

2. Figure out which bucket that hash value would go into.
  This is a fixed numerical calculation.

3. Walk through the linked list that that bucket points to,
  checking whether or not you have the right value.  As an
  optimization Perl does this check by first comparing the
  hash value and string lengths and only then does it
  compare the strings for equality.  Walking the list is
  (on average if the hashing algorithm works well) a fixed
  time operation.  But testing for equality takes time
  proportional to the length of the key.

So you see that (if the hashing algorithm does a good job of
distributing keys to buckets), the time to access a hash
element is independent of the number of things in the hash
but proportional to the length of the hash key.

> In a case where collisions wouldn't be a real problem, I guess that's a
> non-issue, but those are rare cases.

If your hashing algorithm does not take time proportional to
the length of the thing to be hashed, then it is ignoring
possible differences in big chunks of that thing.  Cases where
you'd be willing to do that are likely few and far between.

Cheers,
Ben
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to