Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-21 Thread travis+ml-rbcryptography
On Fri, May 20, 2011 at 05:18:16PM -0500, Nico Williams wrote:
  I wonder if A/V shouldn't use something similar?
 
 The rsync rolling CRC is useful for detecting insertions an deletions
 -- i.e., remote diff.

Right, but right now some anti-virus does hashes over the whole file,
or so I've heard, and so even a single bit flip in a resource (icon)
can defeat some of them.

What'd be nicer is fuzzier matches --- perhaps.

 A function with
 that property isn't a hash function.

How do you figure?
-- 
http://www.subspacefield.org/~travis/
I don't believe in luck.  I believe in the law of large numbers.
If you are a spammer, please email j...@subspacefield.org to get blacklisted.


pgp0oyFZA7tIb.pgp
Description: PGP signature
___
cryptography mailing list
cryptography@randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography


Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-21 Thread Zooko O'Whielacronx
Dear Nico Williams:

Thanks for the reference! Very cool.

What I would most want is for ZFS (and every other filesystem) to
maintain a Merkle Tree over the file data with a good secure hash.
Whenever a change to a file is made, the filesystem can update the
Merkle Tree this with mere O(log(N)) work in the size of the file plus
O(N) work in the size of the change. For a modern filesystem like ZFS
which is already maintaining a checksum tree the *added* cost of
maintaining the secure hash Merkle Tree could be minimal.

Then, the filesystem should make this Merkle Tree available to
applications through a simple query.

This would enable applications—without needing any further
in-filesystem code—to perform a Merkle Tree sync, which would range
from noticeably more efficient to dramatically more efficient than
rsync or zfs send. :-)

Of course it is only more efficient because we're treating the
maintenance of the secure-hash Merkle Tree as free. There are two
senses in which this is legitimate and it is almost free:

1. Since the values get maintained persistently over the file's
lifetime then the total computation required is approximately O(N)
where N is the total size of all deltas that have been applied to the
file in its life. (Let's just drop the logarithmic part for now,
because see 2. below.)

Compare this to the cost of doing a fast, insecure CRC over the whole
file such as in rsync. The cost of that is O(N) * K where N is the
(then current) size of the file and K is the number of times you run
rsync on that file.

The extreme case is if the file hasn't changed. Then for the
application-level code to confirm that the file on this machine is the
same as the file on that machine, it merely has to ask the filesystem
for the root hash on each machine and transmit that root hash over the
network. This is optimally fast compared to rsync, and unlike zfs
send|recv it is optimally fast whenever the two files are identical
even if they have both changed since the last time they were synced.

2. Since the modern, sophisticated filesystem like ZFS is maintaining
a tree of checksums over the data *anyway* you can piggy-back this
computation onto that work, avoiding any extra seeks and minimizing
extra memory access.

In fact, ZFS itself can actually use SHA-256 for the checksum tree,
which would make it almost provide exactly what I want, except for:

2. a. From what I've read, nobody uses the SHA-256 configuration in
ZFS because it is too computationally expensive, so they use an
insecure checksum (fletcher2/4) instead.

2. b. I assume the shape of the resulting checksum tree is modified by
artifacts of the ZFS layout instead of being a simple canonical shape.
This is a show-stopper for this use case because if the same file data
exists on a different system, and some software on that system
computes a Merkle Tree over the data, it might come out with different
hashes than the ZFS checksum tree, thus eliminating all of the
performance benefits of this approach.

But, if ZFS could be modified to fix these problems or if a new
filesystem would add a feature of maintaining a canonical,
reproducible Merkle Tree, then it might be extremely useful.

Thanks to Brian Warner and Dan Shoutis for discussions about this idea.

Regards,

Zooko
___
cryptography mailing list
cryptography@randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography


Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-21 Thread Nico Williams
On Sat, May 21, 2011 at 2:53 AM,
travis+ml-rbcryptogra...@subspacefield.org wrote:
 On Fri, May 20, 2011 at 05:18:16PM -0500, Nico Williams wrote:
 A function with
 that property isn't a hash function.

 How do you figure?

Well, to be fair, a rolling hash is a hash function, proper.  It may
well not be what we'd call a cryptographically secure function, and
I'll admit I'm not certain of that, that it is my intuition that a
rolling hash is not cryptographically secure.  I find very little
research on cryptographically secure rolling hash functions (for
laughs though, search for secure rolling hash), but even so, I'm
having second thoughts about my statement above.

Nico
--
___
cryptography mailing list
cryptography@randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography


Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-21 Thread Nico Williams
On Sat, May 21, 2011 at 1:50 PM, Zooko O'Whielacronx zo...@zooko.com wrote:
 What I would most want is for ZFS (and every other filesystem) to
 maintain a Merkle Tree over the file data with a good secure hash.

Me too.  ZFS does do that, but unfortunately the internal Merkel hash
maintained this way also has metadata tree nodes (namely, indirect
blocks), which wouldn't be a problem except that embedded in that
metadata (and hashed in) are block pointers, which contain some data
which is completely non-deterministic relative to the file contents.
The right thing to have done would have been to exclude the address
portion of block pointers, but that owuld have required more data
movement and/or computation (and perhaps was not thought of at the
time, but I was never in the ZFS team, so I'd not know.

 Whenever a change to a file is made, the filesystem can update the
 Merkle Tree this with mere O(log(N)) work in the size of the file plus
 O(N) work in the size of the change. For a modern filesystem like ZFS
 which is already maintaining a checksum tree the *added* cost of
 maintaining the secure hash Merkle Tree could be minimal.

Right!

 Then, the filesystem should make this Merkle Tree available to
 applications through a simple query.

Indeed.  And if non-deterministic metadata is excluded then the only
thing one would have to know in order to independently compute the
same hash would be the maximum breath of the tree and leaf block size,
and that the tree is intended to be kept with all but the rightmost
leaf node full.  This would be a wonderful feature to have.

 This would enable applications—without needing any further
 in-filesystem code—to perform a Merkle Tree sync, which would range
 from noticeably more efficient to dramatically more efficient than
 rsync or zfs send. :-)

I haven't thought about this enough, but my level-0 concern would be
that data moves not aligned with block size would require a rolling
hash in order to make synchronization efficient, thus the Merkle tree
wouldn't help unless, perhaps, it were constructed from rolling hash
functions.

The other thought I had was that now that we accept that file stores
need large amounts of computational power, not jut I/O bandwidth, it's
not a stretch for the server to also compute and store (i.e.,
pre-compute) arrays of rolling CRCs, each taken over a small data
chunk contiguous with the preceding and succeeding ones.  This would
allow a client to very quickly read the rsync signature of a remote
file and locally generate a patch to it, making the write process very
efficient for large files that get many small changes all over the
place.

 Of course it is only more efficient because we're treating the
 maintenance of the secure-hash Merkle Tree as free. There are two
 senses in which this is legitimate and it is almost free:

And/or because we intend to use this for optimizing an otherwise slow
operation, and if we do it enough then it'd be worthwhile even if the
given FS had never before used a Merkle tree.
 [...]
 2. a. From what I've read, nobody uses the SHA-256 configuration in
 ZFS because it is too computationally expensive, so they use an
 insecure checksum (fletcher2/4) instead.

You kinda have to use SHA-256 if you want dedup.

 2. b. I assume the shape of the resulting checksum tree is modified by
 artifacts of the ZFS layout instead of being a simple canonical shape.

Yes: a) the shape of the indirect block tree (but this is
deterministic if you know the maximum block size for the filesystem
and the maximum block size for the file), and b) physical block
address information in indirect blocks (which can and should have been
excluded -- it's good enough to compute the hash of an indirect block
over the block checksums in each block pointer).  See above.  Oh,
also, there the first few block pointers in the dnode too.

 This is a show-stopper for this use case because if the same file data
 exists on a different system, and some software on that system
 computes a Merkle Tree over the data, it might come out with different
 hashes than the ZFS checksum tree, thus eliminating all of the
 performance benefits of this approach.

Indeed, but:

 But, if ZFS could be modified to fix these problems or if a new
 filesystem would add a feature of maintaining a canonical,
 reproducible Merkle Tree, then it might be extremely useful.

I think it could be.  I don't think you'll get much out of ZFS folks
nowadays though.  Maybe if you talk to ZFS folks outside of Oracle
though, they might confirm whether this is doable.

BTW, IIRC I filed an RFE at Sun to expose a Merkle hash tree checksum
to applications.  I believe I've seen others ask for this before too
at various times, and it may be that the RFE I filed (if I did) was in
response to a request on one of the OSOL discuss lists -- this must
have been back in 2005.  This is a really, really obvious enhancement
to want -- it should be obvious the moment you realize that Merkle
hash trees 

Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-20 Thread Zooko O'Whielacronx
On Fri, May 20, 2011 at 3:30 PM,
travis+ml-rbcryptogra...@subspacefield.org wrote:

 I wonder if A/V shouldn't use something similar?

What's A/V?

 I assume MD4 is an outdated choice - perhaps some cryppie needs to
 design a hash function that is specifically designed for a FIFO kind
 of window?  Maybe there is and I'm just out of the loop.

 Potentially another application is for metadata silvering on file
 systems like ZFS, where we want to keep an updated checksum for a
 file, to detect corruption, but still want to have, say, efficient
 writing to the file - can you support appending?  How about random access?

 Also, FEC defends against an unintelligent adversary; I wonder if we
 couldn't defend against stronger ones (MAC/MIC) efficiently and
 neutralize the unintelligent one (nature and errors) for free?  It
 seems a shame to tack two sets of metadata onto our data.

All of the above seems well suited to maintaining a Merkle Tree over
the file data with a secure hash.

Regards,

Zooko
___
cryptography mailing list
cryptography@randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography


Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-20 Thread Nico Williams
On Fri, May 20, 2011 at 4:30 PM,
travis+ml-rbcryptogra...@subspacefield.org wrote:
 Just a quick thought, I noticed the other day that rsync uses a
 rolling MD4 hash or something like that to detect changes in a
 window of data.

A quick look around should tell you that it uses a rolling checksum
and a hash function.  MD4 is one such hash function.  The rolling
checksum is a CRC, which is to say, not a hash function.

 I wonder if A/V shouldn't use something similar?

The rsync rolling CRC is useful for detecting insertions an deletions
-- i.e., remote diff.

 I assume MD4 is an outdated choice - perhaps some cryppie needs to
 design a hash function that is specifically designed for a FIFO kind
 of window?  Maybe there is and I'm just out of the loop.

MD4 isn't the function with the rolling property.  A function with
that property isn't a hash function.  It might be a strong CRC though,
which might be good enough for error detection, or not.

 Potentially another application is for metadata silvering on file
 systems like ZFS, where we want to keep an updated checksum for a
 file, to detect corruption, but still want to have, say, efficient
 writing to the file - can you support appending?  How about random access?

That would be nice, but I don't think a CRC offers strong enough
protection.  What might be nice is if the filesystem could export an
API for getting rolling CRC data -- it might speed up rsync-like
applications.  I filed an RFE for that in ZFS back at Sun (Oracle)
years ago, and I posted about it back in 2005 in this thread:

https://www.opensolaris.org/jive/thread.jspa?threadID=12917

Nico
--
___
cryptography mailing list
cryptography@randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography