Re: [Boston.pm] Max hash key length

Aaron Sherman Mon, 03 Jan 2005 07:23:33 -0800

On Thu, 2004-12-30 at 20:45, Ben Tilly wrote:
> On Thu, 30 Dec 2004 18:02:07 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote:


> > I understand risk assessment and the idea that nothing is 100% safe, but
> > when you have a situation where you KNOW from day one that some keys
> > will collide, and your data will be corrupted, you don't build that into
> > your system if you have an easy out.
> 
> Then I recommend that you never use rsync.  As for me, I'm
> sometimes willing to accept the possibility of algorithm failures
> which are less than the odds of my program going wrong
> because of cosmic radiation.

Take a look at rsync. It relies, by default, on four pieces of
information exactly because trusting a single checksum is far too risky
for a general-purpose tool. That information is: file size, timestamp,
large-scale checksum, block-level checksum. Moreover, rsync allows the
interpretation of each and every one of those to be tuned from the
command-line.

And yes, while I usually just trust to the law of probability (which is
a very strange feature of our universe, if you stop to think about it),
when I'm doing something that requires certainty I do not rely on
rsync's block-for-block checksum strategy. Instead, I do one of two
things:

     1. I force a full copy of files based on timestamp (costly, of
        course)
     2. I add one byte to the end of all files, rsync normally, remove
        the byte and rsync again. This results in odds of correct
        identification of changes that is an order of magnitude better,
        but still not perfect. Again, this is a matter risk assessment.

> > This is hashing 101. You hash, you bucket based on the hashes, and then
> > you store a list at each bucket with key and value tuple for a linear
> > search. There are other ways to do it, but this is the classic.
> 
> Yes, I'm familiar with this, and outlined it in a previous email in
> this thread.

So, why were we discussing a module that would tie a hash to a lossy
checksumming operation, when standard practice in computer science
provides a simple solution?

> > Of course, Perl does this for you. That extra time that I measured is
> > almost certainly the time spent comparing the two strings, which your
> > tie interface will also have to do because of collisions.
> 
> Want to bet whether Perl spends more time in computing hash
> values or comparing strings?

I think you missed the point. I was not discussing a trade-off, but
rather suggesting why my example played out the way it did. If you're
suggesting pre-computing hashes, not just as a front-end to fetch/store,
but on a persistent basis, then you will get a very noticeable
performance win, but at the cost of having to perform some kind of
copy-on-write or other tainting check for modified keys (slowing all
other operations)... which of course, it turns out that Perl already
does for you, see sv.[hc] and hv.c in the standard Perl 5.8.x
distribution ;-)

-- 
â 781-324-3772
â [EMAIL PROTECTED]
â http://www.ajs.com/~ajs

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Max hash key length

Reply via email to