[apologies if this gets double-posted, I've been having email problems today]
On Thu, 20 Mar 2008 21:56:58 -0700 Jim McCoy <[EMAIL PROTECTED]> wrote: > > confirmation-of-a-file attack -- if the attacker already knows > > the full plaintext of a file, then they can check whether a > > given user has a copy of that file. > > The truth of this depends on implementation details, and is an > assertion that cannot be said to cover all or even most of the > potential use-cases for this technique. For concreteness (and for the benefit of the readers who aren't familiar with Tahoe), I'd like to throw in a few notes about how Tahoe does things. For each immutable file that gets uploaded, the first step is to decide upon an AES encryption key. One option is to simply pick a random 128-bit number. Another is to hash the file. Tahoe 0.9.0 uses a tagged SHA-256d hash of the file. (every different use of a hash function in Tahoe gets a distinct tag.. in this case, the tag is the netstring of "allmydata_immutable_content_to_key_v1+", plus a string that encapsulates (k, N, segment_size), to make sure that alternate encodings of the same file result in distinct versions. See [1] for more details.). Regardless of how the encryption key is generated, the key is hashed to create the "Storage Index", or "SI". This storage index has two purposes: to determine the set of servers on which the shares will be placed, and to index the shares on those servers. When using CHK, the SI is a second-order hash of the plaintext. When not using CHK, the SI is a hash of a random string. The storage servers are basically just remote tables that map Storage Index to a chunk of data. Any client with storage authority (i.e. an account of some sort) can add new entries to this table by uploading a share. Any client can request the share associated with a given index. So when convergence is in use, any attacker who gets to see the storage index is effectively seeing a hash of the file's plaintext.. not an exact hash, but a value which is derived from the plaintext with all the same properties of your normal cryptographic hash function. Who gets to see a storage index and correlate it to a given user? The SI is a parameter in all share upload and download requests, and it is present inside the "verifier-cap" and "repair-cap" values (which can be transferred to delegate verification or repair duties). If the user's client machine performs any of these requests directly, the machine at the other end will see the correlation. The links between the client and the storage servers are encrypted (since those are Foolscap connections), so an observer watching those links will not learn the SI values. However the storage servers themselves can see anything. Clients which are uploading files will first ask at least N servers whether they already have the share in question: those servers will learn that the client holds the given file and wants to publish it. Clients who are trying to download the file will ask at least k servers if they have the share: those servers will learn that the client desires the given file. So an adversary who manages to provide or compromise more than a handful of storage servers will be able to observe the storage index and client identity for every read or write operation performed in the grid. Running just one server is enough to see a significant portion of the requests. Our conservative design treats the storage servers as part of the threat. We expect these servers to return data that was previously uploaded, nothing more: the security properties of the system must hold even if the servers corrupt or publish the shares. We assume that everything on the storage servers is public knowledge. (Note that our *availability* properties depend upon enough of the servers returning their shares uncorrupted, but our *security* properties do not). Granted, in a commercially-run storage grid, clients will only be using servers that the company provides, and we do not write storage servers that corrupt or publish the shares they are storing. But in most other use cases, we expect that it will be fairly easy for a potential eavesdropper to bribe users into revealing the storage index values that they are interested in: just offer them storage space :-). Also note that we record the exact file size inside each share, so that the ciphertext file can be reconstructed and verified (and repaired) using just the information from a single share. This reduces the guessing space considerably. For large files, the file size may be fairly unique, so it may be possible to perform correlation even without a hash of the contents. Finally, for accounting/quota purposes the storage servers record an account number in each lease on each share. This number only needs to be correlated between servers by some system-wide accounting server, so that it can determine that Alice is using 5GB of space because she has 1GB of shares on server 1, 2GB on server 2, and 2GB on server 3 (and bill her accordingly). The account number could be hashed with the serverid and made different on each client, but this would increase the amount of work that the accounting server must do (or store), an would make certain delegations more difficult (like the Upload Helper). > This property only holds if it is possible for the attacker to link a > selected ciphertext/file to a user. Systems which use convergent encryption > to populate a shared storage pool _might_ have this property, but is my no > means a certainty; if a system is implemented correctly is is not necessary > for users to expose their list of files in order to maintain this shared > storage space. At the PET Workshop last year, I spent some time brainstorming about what we would do if we were trying to offer anonymity (which we aren't). I believe that turning off convergence is a first requirement. Hashing the SI with the serverid to create a per-server index would make it more difficult for a server to correlate one of their shares against shares on a different machine. Adding padding or rounding up the filesize to some convenient boundary (Zooko suggested the segment size, probably 128KiB for us) would make filesize less of a distinguisher. But these have their costs. Per-server index values would make repair more difficult: servers wouldn't have enough information to perform repair on their own, so the original client would need to be involved in the repair process. Padding would consume more storage space. For Tahoe, we value reliability and simplicity over anonymity, so we haven't pursued these approaches. > To be a bit more specific, this is really just a version of a standard > dictionary attack. The solution to this problem is to look at similar > systems that suffered from dictionary attacks an see what solutions were > created to solve the problem. Yeah, exactly. Most of the solutions I'm aware of are to add a random salt, and thus break convergence. > The problem with this imagined attack are twofold... Unfortunately the way we're using Storage Index values in Tahoe exposes a lot more information to the attacker than the mnet design did (and in this case we're imagining the storage servers to be the attacker). Rather than storing FEC-output blocks according to their hash, Tahoe is storing shares according to their SI. So the correlation job is much easier: a pair of hashes at worst. > Assuming an all-seeing oracle who can watch every bit sent into the > storage pool will get us around this first problem, but it does raise > the bar for potential attackers. True. In Tahoe, someone who isn't running a storage server (or a Helper or a Repairer) won't be able to see the storage index, so they'll have a lot less information to work with. But in a distributed system, you don't want your security properties to depend upon this. > > Defense Against Both Attacks > > > > [...] > > Bad idea, for a couple of reasons. This first problem is that you are > assuming the added secret is adding a significant amount of entropy > and the second is that you are throwing out the advantage of > convergent encryption. Towards the first problem, the secret that we hash in is required to be a random string at least as long as the AES key. So by definition it is adding enough entropy to bring the unguessability back up to the keysize limit. Towards the second problem, yeah, that's the tradeoff :). Per-client or per-account convergence retains some of the benefits without exposing the low-entropy files to attack by folks who don't know your mixin secret. > If the secret is shared across multiple files then I can run a > known-plaintext attack on your system using a file that I assume I have in > common with my target (e.g. a standard, small, OS file) to get their > added_secret and then once I know my target's secret I move on to the file > I really want to go after. I think this is only true if the secret is a user-generated password. Since we plan to use a long machine-generated random string, I believe the known-plaintext attack becomes as hard as guessing that string, and we can make that as hard as we want. This secret would be generated by the client the first time it runs, and kept in a private file. To share it between accounts, some out-of-band mechanism is needed to make sure all clients get the same value, but by no means will it be created by a human's low-entropy brain :-). > A better solution would be to look at something like bcrypt() (see "A > Future-Adaptable Password Scheme") and use this mechanism for files > below a certain threshold. Interesting.. I'll read up on that. > -Personal documents > -Photos > -"Home" movies > > The size of the first is negligible. The second is large and getting > larger, and the third is probably a lot smaller than most of us > think. Photos are really the only one that will skew this result. True, although I'm more concerned about the small files these days since I started thinking about the per-file storage overhead (Tahoe stores a lot of hashes along with the share data, and ext3 minimum block sizes are 4KiB on large disks). I hope to include this overhead information in the analysis that we're doing with the allmydata userset to get a feel for how negligible the small files really are. There are a lot of factors that I can imagine influencing the value of convergence. Home movies are likely to be less-replicated. Downloaded media is likely to be more-replicated between groups of friends (who influence each other's tastes) than between unrelated people: if use of Tahoe spreads from one friend to another, we might see more replication than if it were promoted by more traditional broadcast advertisement. There may be smaller files that are so heavily replicated that the overhead on them would make them worth converging. The problem that concerned us is that the guessability/entropy/sensitivity of the file is not entirely correlated with size. Large files are probably media, media is probably not sensitive and probably shared. But we can imagine files which are large, sensitive, and low-entropy, all at the same time. I'd be a bit nervous about deploying a heuristic that retained a chance of exposing certain user files to guessing attacks. Once we've done some analysis and made an estimate on the benefits of convergence, we'll be in a better position to make a decision about the tradeoffs. I suspect that we'll wind up with a threshold that says we use cross-client convergence for any file that's bigger than 2MB, and only per-client convergence below that, and then try to make it clear to users what the concern is so they can click the "make it more secure" button for their large+guessable+sensitive files. But I want to know what the space savings is (and who really benefits from it) before I make that decision. thanks, -Brian [1]: http://allmydata.org/trac/tahoe/browser/src/allmydata/util/hashutil.py _______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
