zooko
Fri, 21 Mar 2008 12:07:15 -0700
Dear Perry Metzger:Jim McCoy asked me to forward this, as he is not subscribed to cryptography@metzdowd.com, so his posting bounced.
Regards, Zooko Begin forwarded message:
From: Jim McCoy <[EMAIL PROTECTED]> Date: March 20, 2008 10:56:58 PM MDTTo: theory and practice of decentralized computer networks <p2p- [EMAIL PROTECTED]>Cc: [EMAIL PROTECTED], Cryptography <cryptography@metzdowd.com>Subject: Re: [tahoe-dev] [p2p-hackers] convergent encryption reconsideredReply-To: [EMAIL PROTECTED] On Mar 20, 2008, at 12:42 PM, zooko wrote:Security engineers have always appreciated that convergent encryption allows an attacker to perform a confirmation-of-a-file attack -- if the attacker already knows the full plaintext of a file, then they can check whether a given user has a copy of that file.The truth of this depends on implementation details, and is an assertion that cannot be said to cover all or even most of the potential use-cases for this technique. This property only holds if it is possible for the attacker to link a selected ciphertext/file to a user. Systems which use convergent encryption to populate a shared storage pool _might_ have this property, but is my no means a certainty; if a system is implemented correctly is is not necessary for users to expose their list of files in order to maintain this shared storage space.basically people can tell which files you are storing if they are "publically known" files, but they can't learn anything about your own personal files.It sounds like you have a design problem. If nodes that participatein the system can distinguish between publication and _re_- publication/replication (or whatever you want to call the random sharing of arbitrary data blocks for the purposes of increasing file availability) then you have a problem. If these two activities are indistinguishable then an observer knows you have some blocks to a file but should not be able to distinguish between you publishing the blocks and the act of re-distribution to increase block availability.The Learn-Partial-Information Attack [...]A better title for this would be "Chosen-Plaintext attack on Convergent Encryption", since what you are talking about is really a chosen plaintext attack. To be a bit more specific, this is really just a version of a standard dictionary attack. The solution to this problem is to look at similar systems that suffered from dictionary attacks an see what solutions were created to solve the problem. The most widely known and studied version of this is the old crypt()/ passwd problem.For another example, if you use Tahoe to backup your entire home directory, or your entire filesystem, then the attacker gains the opportunity to try to learn partial information about various files which are of predictable format but have sensitive fields in them, such as .my.cnf (MySQL configuration files), .htpasswd, .cvspass, .netrc, web browser cookie files, etc..The problem with this imagined attack are twofold. I will use your Tahoe example for my explanations because I have a passing familiarity with the architecture. The first problem is isolating the original ciphertext in the pool of storage. If a file is encrypted using convergent encryption and then run through an error-correction mechanism to generate a number of shares that make up the file an attacker first needs to be able to isolate these shares to generate the orginal ciphertext. FEC decoding speeds may be reasonably fast, but they are not without some cost. If the storage pool is sufficiently large and you are doing your job to limit the ability of an attacker to see which blocks are linked to the same FEC operation then the computational complexity of this attack is significantly higher than you suggest. Assuming an all-seeing oracle who can watch every bit sent into the storage pool will get us around this first problem, but it does raise the bar for potential attackers. The second problem an attacker now faces is deciding what sort of format a file might have, what the low-entropy content might be, and then filling in values for these unknowns. If your block size is small (and I mean "really small" in the context of the sort of systems we are talking about) there might be only a few kilobits of entropy in the first couple of blocks of a file so either a rainbow-table attack on known file formats or a dedicated effort to grab a specific file might be possible, but this is by no means certain. Increase your block size and this problem becomes much harder for the attacker.Defense Against Both Attacks [...] However, we can do better than that by creating a secret value and mixing that value into the per-file encryption key (so instead of symmetric_key = H(plaintext), you have symmetric_key = H(added_secret, plaintext), where "," denotes an unambiguous encoding of both operands). This idea is due to Brian Warner and Drew Perttula.Bad idea, for a couple of reasons. This first problem is that you are assuming the added secret is adding a significant amount of entropy and the second is that you are throwing out the advantage of convergent encryption. If the secret is shared across multiple files then I can run a known-plaintext attack on your system using a file that I assume I have in common with my target (e.g. a standard, small, OS file) to get their added_secret and then once I know my target's secret I move on to the file I really want to go after. If your userbase consists of paranoid cyperpunks then the added secret just becomes a lookup in a rainbow table, and if they are the sort of userbase that some people like to call "the consumer market" then the entropy being added here is negligible. A better solution would be to look at something like bcrypt() (see "A Future-Adaptable Password Scheme") and use this mechanism for files below a certain threshold. I do not think that the expensive key scheduling operation discounts the use of CTR mode, so you are safe on that side of things, but it would mean that you step away from AES...A Comment About Space Savings One of the original motivations for convergent encryption, as expressed in Doceur's technical report, is to conserve storage space by coalescing identical files. [...] My suspicion is that the gains available for modern uses are nowhere near that good -- I wouldn't be surprised if it were less than 5% [...]I would be very, very surprised if you were correct.My reasoning is: 1. The Doceur et al. test set was probably the ideal test set to highlight the advantages of coalescing: it was hundreds of workstations, which probably all followed the same Information Technology Department policies, which were probably homogeneous, and which were probably packed with many of the same software products (especially Microsoft products).Are you suggesting the rest of the world uses software products that are radically different than those used in a corporate environment :) We all use a small set of operating systems, that share the same basic set of files. We all use the same small set of application software and even if there is a "long tail" in application software the amount of storage this occupies compared to the operating system and most common applications is likely to be a very small fraction of the total being stored.2. Movies and music. In 2002, on Windows workstations in an office at Microsoft, there probably weren't any movies. For some of the probable use cases for Tahoe, movies and collections of music are the largest component. The presence of movie files, which are large, could increase or decrease the proportion of savings from coalescing files, depending on the degree to which those files are shared among multiple users, which brings us to the next point:This was actually a concern of mine when originally designing the system, but for a different reason. At that time the large content files we were concerned with were music. We knew that the amount of available music was a bounded total, but at that point in time the world had not converged on a standard set of codecs and standard bitrates for these codecs. We worried about the fact that the same song would end up with multiple copies in the pool because some people would use LAME, some would use WinAMP, others a pirated Fraunhofer codec, and none of them would use the same bitrate. This problem still exists somewhat, but the fact that most people use only a few packages for CD ripping, codecs have standardized, and most encoding uses only a few standard bitrates makes the job much easier and the storage savings to be realized significantly larger. When it comes to movies, the fact that a lot of them are *cough* "distributed" from a few master copies and seldom re-encoded means that you are likely to see many copies of the same bits when it comes to this category. DVD ripping is also a fairly standardized process, so you win again. Given the fact that people are not buying terabyte drives to store their scans of Proust folios I think we can make an educated guess as to what is taking up a lot of space on these drives and how much duplication there is among these bits.3. The Long Tail; Today there seems to be a more diverse set of things taking up a bigger portion of the total.For all the talk of "the long tail" it seems that we all tend to visit the same set of YouTube clips and archive a lot of the same information. More info is being generated by diverse groups, but the amount being generated still seems to be growing a lot slower than available storage capacity. One side-effect of the long-tail is that as groups become more diverse in their taste in information they still tend to share the information that is specific to their interests. You might see the storage advantages decline somewhat until you have a significant userbase, but when considered in relation to point #2 this is really a minor issue.4. Personally-produced photos and movies. It seems like some of the files that people are most eager to back up and to share are the product of their own cameras. Such files will typically be shared by only one or a few users.This is the only point I can agree upon. Personally-generated media (which is distinct from "user-generated content") is what makes individual systems unique. On a generic user's computer this breaks down into three basic categories: -Personal documents -Photos -"Home" movies The size of the first is negligible. The second is large and getting larger, and the third is probably a lot smaller than most of us think. Photos are really the only one that will skew this result. In conclusion I would have to say that the sky is not really falling. The attack outlined is a known problem that can be solved in ways that still preserve the advantages of convergent encryption. _______________________________________________ tahoe-dev mailing list [EMAIL PROTECTED] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
--------------------------------------------------------------------- The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]