Here's what I've been using for downloading my email. It lets me download 3 megabytes of email, or about 500 messages, in about two minutes, compressed down to a little over 1 megabyte; and it lets me do that without giving my password to an insecure machine in an internet access center. No encryption yet, though.
#!/usr/bin/python """Synchronize files in store-and-forward fashion. My laptop currently may not have a working network connection, and in Ecuador we often couldn't use any direct internet connections anyway --- 'ciberes' or 'cafenetes' were everywhere, but few of them would let us hook up our laptops to their Ethernet, and fewer of them had wifi. So often I would send email by putting a batch on an SD card, going to the cafenet, and uploading it; until now, I didn't have a similar approach for downloading. Rsync ----- My usual mail-management scheme is to rsync my inbox, or at least the tail end of it, onto my laptop. But rsync doesn't run over store-and-forward. rsync has this very ingenious scheme which avoids copying over any big chunks of data you already have. My understanding is that it works as follows: - the receiver takes the CRC-32 of each disjoint block of a certain fixed size in their copy, and sends them to the sender (so that, for example, if the block is 4kiB and the file is 1MiB, they send 256 CRCs totaling 1kiB); - the sender takes the CRC-32 of each block of that size in their file, but not just disjoint blocks --- so that, for example, if their copy of the file is 1MiB and the block size is 4kiB, they take 1048576 - 4096 + 1 CRC-32s, and remember which ones represent blocks the receiver already has (and which blocks); - the sender sends instructions on how to create the correct file by copying blocks from the receiver's version of the file and inserting new data where the receiver doesn't already have it. Doing the second step in a brute-force way would be fairly slow --- in this case, it would involve more than 4 billion CRC operations. However, given the CRC-32 of 'x' + stuff (where '+' is string concatenation), it's straightforward and efficient to compute the CRC-32 not only of 'x' + stuff + 'y', but also of stuff + 'y'. (Rabin-Karp string search works similarly.) Unfortunately, I don't have a CRC-computing library handy, and I don't understand CRC well enough to implement it myself. Additionally, I have another issue. I'm rsyncing my email, so other people have control over the data I'm rsyncing; with a naive implementation of the above algorithm, they could send me a chunk of data with the same CRC as some later email they expect I will get, in order to prevent me from getting the later email. This is a problem because it's easy to construct data with a particular CRC. (In a less naive implementation, the receiver would send a secure hash of each block as well, which the sender would check before trusting the CRC.) There's an additional problem that both the sender and the receiver have to read the entire file before any of it gets transferred, which is a problem when I'm being charged by the minute in a cafenet. This program ------------ So I'm implementing something similar, but simpler, with the following differences: - it's store-and-forward; - both the sender and the receiver work with only disjoint blocks of the file; - they use (the first 128 bits of) the SHA-1 sum rather than CRC; - I'm thinking about implementing an 'append-only' mode in which the sender assumes that every block except the last is correct, without checking. But as long as it only takes 45 seconds to hash my whole mailbox, I'm not gonna bother. - The block size is chosen so that the 'waste' data transmitted from the receiver is about the same size as the 'waste' data from the final block redundantly transmitted by the sender, for files of about the size of my mailbox, 1GB. They are inversely proportional. The expected waste for the sender is half a block, compressed by a fixed factor of about 3; the waste sent by the receiver is 16*filesize/blocksize. 16*filesize/blocksize = blocksize/3 when 3*16*filesize = blocksize^2, which is close to 256kiB (2^18 bytes) for a filesize of 2^20 bytes. In its current form, it takes about 45 seconds on my laptop to read my 862MiB mailbox, yielding a 55194-byte 'hashes' file. File Format ----------- This program uses two kinds of files: hash files and updates files. A hash file begins with hashfile_magic_number (see the code), a newline-terminated line containing the blocksize in ASCII, and a newline-terminated line containing the original file size in ASCII. The next line is a one-time password, which may be used by a server to decide whether or not to answer a request. The rest of the file is binary hash data, 16 bytes per hash, all mushed together. An updates file is gzipped, but when gunzipped is plain ASCII (as far as the original file whose data is contained therein is plain ASCII, anyway). The gunzipped data begins with updatefile_magic_number (see the code) and a newline-terminated line containing the blocksize in ASCII. Then follow two or more 'chunks'; each chunk begins with a magic number, one of block_magic, done_magic, or trunc_magic (see the code). Chunks that begin with block_magic follow it with the block number (as a newline-terminated ASCII number), the size of that block (usually the file blocksize, but the last block in the file might be shorter) (as a newline-terminated ASCII number), and the data of the block. You'll need to count bytes to find the end of the data of the block. A chunk that begins with done_magic has nothing else in it, and terminates the file. A chunk that begins with trunc_magic follows it with the original size of the file, as a newline-terminated ASCII number. TODO ---- D Add one-time passwords (optionally include in hashes file?) D add --passwords option to --hash D change hash file format to have a password line D add --passwords option to --send (and cgi interface) D Restrict upload of arbitrarily large files (maybe with stdin proxy?) """ import sha, sys, gzip, cgi, cgitb, os, StringIO, itertools ### Magic numbers # Generally changing these will lead to incompatibilities; they are # used (some implicitly) in the file formats. default_block_size = 2**18 hashsize = 16 hashfile_magic_number = 'syncfiles hashfile\n' updatefile_magic_number = 'syncfiles updatefile\n' block_magic = '\nblock\n' done_magic = '\ndone!\n' trunc_magic = '\ntrunc\n' ### Exceptions class ExtraDataInUpdateFile(Exception): """The updates file should end after its 'done!' token, but one didn't.""" class MismatchedBlockSize(Exception): """Receiver is using the wrong block size.""" class MismatchedMagicNumber(Exception): """A file has the wrong magic number and is therefore the wrong kind of file.""" class TruncatedHash(Exception): """A hash file ended in the middle of a hash value and is therefore probably incomplete.""" class UnexpectedEndOfUpdatesFile(Exception): """The updates file didn't have a 'done!' token and is therefore probably incomplete.""" class BlockSizeUnacceptablySmall(Exception): """The block size of the uploaded file is too small; it would lead to unacceptable memory consumption.""" class TooMuchDataUploaded(Exception): """A SizeLimitingStdinProxy got too much data. Probably somebody tried to upload too much data to a CGI script.""" class InvalidPassword(Exception): """Somebody tried to authenticate a hash file with a password that didn't pass.""" ### Password file handling class NullPasswordRepository: """Password repository that validates any password and provides a dummy password when needed.""" def password_is_valid(self, password): return True def get_password(self): return 'no password repository given' class AtomicallyUpdatableFile: """Atomic updates for a text file. Only readable by __iter__. The process calling this must have permission to read and delete the file, although not necessarily to write it, and to create files in the same directory. Depends on Unix atomic rename(). """ def __init__(self, filename): self.filename = filename def __iter__(self): """Iterate over lines of current contents of file.""" return iter(file(self.filename)) def update(self, iterable): """Put results of iterable into new version of file, atomically. You can still have lost-updates problems with this if you have multiple concurrent updaters, which (XXX!) might potentially be a security problem allowing password-guessing. """ nfn = self.filename + '.new' out = file(nfn, 'w') try: for item in iterable: out.write(item) out.close() os.rename(nfn, self.filename) finally: if os.path.exists(nfn): os.remove(nfn) class FilePasswordRepository: """Password repository that uses a text file, one password per line.""" def __init__(self, filename): self.file = AtomicallyUpdatableFile(filename) def __repr__(self): return 'FilePasswordRepository(%r)' % self.file.filename def passwords(self): """Iterate over passwords from the file.""" for line in self.file: yield line.rstrip('\n') def password_is_valid(self, password): """If 'password' is in the file, remove it and return True. Otherwise, return false. """ for possible_password in self.passwords(): if password == possible_password: self.eliminate_password(password) return True return False def get_password(self): """Choose a password from the file (removing it).""" rv = self.passwords().next() self.eliminate_password(rv) return rv def eliminate_password(self, password): """Internal: remove a password from the file.""" self.file.update(line for line in self.file if line.rstrip('\n') != password) ### Hashfile and update file handling class NumberPacker: """Serializes and deserializes numbers. The interface is pretty much the same as that of the 'Struct' object I wrote for mail indexing, but the format is newline-delimited ASCII. """ def pack(self, num): """Return an ASCII representation of a number.""" return '%d\n' % num def read(self, fo): """Read a number from a file object, returning it in a 1-tuple.""" line = fo.readline() return self.unpack(line) def unpack(self, data): """Deserialize a number from its ASCII representation. Returns it in a 1-tuple (which is kind of silly, but is for compatibility with Struct.) """ return (int(data),) def write(self, fo, num): """Write a number to a file object.""" fo.write(self.pack(num)) number = NumberPacker() def hashes(fileobj, blocksize=default_block_size): """Iterate over the SHA-1 hashes of the blocks of a file.""" while 1: data = fileobj.read(blocksize) if not data: break yield sha.sha(data).digest()[:hashsize] def make_hashfile(fileobj, fileout, passwords, blocksize=default_block_size): """Write out a hashfile in the format expected by this program.""" fileout.write(hashfile_magic_number + number.pack(blocksize)) fileobj.seek(0, 2) # to EOF fileout.write(number.pack(fileobj.tell())) # length of file password = passwords.get_password() assert '\n' not in password fileout.write(password + '\n') fileobj.seek(0) for hash in hashes(fileobj, blocksize): fileout.write(hash) def _read_hashfile(infile): """Iterate over the hashes read from the body of a hashfile.""" while 1: next_hash = infile.read(hashsize) if not next_hash: return if len(next_hash) != hashsize: raise TruncatedHash(repr(next_hash)) yield next_hash def read_hashfile(infile): """Decode a hashfile, given an open file object on it. Returns a tuple of blocksize, original file size, password, and an iterator over the hashes. """ magic_number = infile.read(len(hashfile_magic_number)) if magic_number != hashfile_magic_number: raise MismatchedMagicNumber(hashfile_magic_number, magic_number) blocksize = number.read(infile)[0] input_file_size = number.read(infile)[0] password = infile.readline().rstrip('\n') return blocksize, input_file_size, password, _read_hashfile(infile) def make_update_file(mbox, ourhashes, theirhashes, passwords, outfile): """Make an uncompressed update file for bringing somebody up to date. mbox: the 'master' version of the file to bring them up to date with ourhashes: an up-to-date hash file of the 'master' version theirhashes: a hash file describing what they already have. Must have same block size as 'ourhashes'. passwords: a password repository to authenticate the hashfile with (use NullPasswordRepository to not authenticate) outfile: a file object to write the updates file to. """ ourblocksize, our_file_size, _, ourhashes = read_hashfile(ourhashes) (theirblocksize, their_file_size, password, theirhashes) = read_hashfile(theirhashes) if not passwords.password_is_valid(password): raise InvalidPassword(password, passwords) if ourblocksize != theirblocksize: raise MismatchedBlockSize(ourblocksize, theirblocksize) outfile.write(updatefile_magic_number + number.pack(ourblocksize)) for ourhash, blocknum in itertools.izip(ourhashes, itertools.count()): try: theirhash = theirhashes.next() except StopIteration: theirhash = '' if ourhash != theirhash: outfile.write(block_magic) number.write(outfile, blocknum) mbox.seek(blocknum * ourblocksize) data = mbox.read(ourblocksize) number.write(outfile, len(data)) outfile.write(data) outfile.write(trunc_magic) number.write(outfile, our_file_size) outfile.write(done_magic) outfile.close() def make_compressed_update_file(mbox, ourhashes, theirhashes, passwords, outfile): """Make a compressed update file with GzipFile. See make_update_file for meanings of arguments. """ make_update_file(mbox, ourhashes, theirhashes, passwords, gzip.GzipFile('', 'w', 9, outfile)) outfile.close() def apply_update_file(mbox, updatefile): """Given an uncompressed update file, update our version of the file. So we have a file produced by make_update_file and the file from which the 'theirhashes' file was produced; this applies the updates specified in 'updatefile' in order to make our 'mbox' exactly like their 'mbox'. """ magic = updatefile.read(len(updatefile_magic_number)) if magic != updatefile_magic_number: raise MismatchedMagicNumber(updatefile_magic_number, magic) blocksize = number.read(updatefile)[0] assert len(block_magic) == len(done_magic) == len(trunc_magic) while 1: magic = updatefile.read(len(block_magic)) if magic == block_magic: blocknum = number.read(updatefile)[0] mbox.seek(blocknum * blocksize) this_block_size = number.read(updatefile)[0] mbox.write(updatefile.read(this_block_size)) elif magic == done_magic: extra = updatefile.read(80) if extra: raise ExtraDataInUpdateFile(repr(extra)) break elif magic == trunc_magic: file_length = number.read(updatefile)[0] mbox.truncate(file_length) elif magic == '': raise UnexpectedEndOfUpdatesFile() def apply_compressed_update_file(mbox, updatefile): """Same as apply_update_file, but updatefile is assumed to be gzipped.""" apply_update_file(mbox, gzip.GzipFile('', 'r', 9, updatefile)) ### CGI interface class SizeLimitingStdinProxy: """Support cgi.FieldStorage's use of stdin while limiting upload size. Only supports the very minimal part of the file protocol cgi.FieldStorage currently uses on self.fp: .read(size) (no default) and .readline(). When using cgitb, this seems to result in deadlock when run in CGI under (at least) Debian Apache 1.3.33-4: Apache is trying to write uploaded data to the CGI program, but the CGI program is trying to write an error message to Apache, and both pipes are full. After a few minutes, Apache's child has a SIGALRM and kills the CGI program. But deadlock is an OK result --- it prevents attackers from filling up the disk! It makes it easy for them to deny service to the web server, at least for a few minutes. """ def __init__(self, fp=sys.stdin, maxbytes = 2*1024*1024): """fp: the file to limit the size on. maxbytes: the maximum number of bytes to allow to be uploaded. """ self.fp = fp self.maxbytes = self.remaining_bytes = maxbytes def read(self, size): """Proxy .read method of underlying filehandle while limiting bytes.""" rv = self.fp.read(size) self.use_up(len(rv)) return rv def readline(self): """Proxy .readline of underlying filehandle while limiting bytes.""" rv = self.fp.readline() self.use_up(len(rv)) return rv def use_up(self, amount_of_data): """Account for amount_of_data bytes being uploaded, possibly raising. If the total number of bytes used up so far is too great, raises TooMuchDataUploaded. """ self.remaining_bytes -= amount_of_data if self.remaining_bytes < 0: raise TooMuchDataUploaded(self.maxbytes) def run_as_cgi(mbox_fname, title, password_file_name=None, minimum_block_size=2**13): """Provide a CGI interface to make_compressed_update_file. mbox_fname: the path to the master file to serve up parts of title: how to describe it in the <title> password_file_name: the path to a file of one-time passwords for validating requests (default is don't validate) minimum_block_size: the smallest block size to accept in an uploaded hashfile (to keep memory consumption under control). The idea is that you upload a hash file (probably produced by running syncfiles.py --hash on the command line) and get a downloadable updates file in response. You can upload a hash file of any blocksize at all. Provides a very minimal HTML interface to facilitate the uploading. """ cgitb.enable() q = cgi.FieldStorage(SizeLimitingStdinProxy()) if os.environ.get("REQUEST_METHOD") != 'POST': sys.stdout.write('''Content-Type: text/html\n <html><head><title>Synchronize File %(title)s</title></head><body> <h1>Synchronize File %(title)s</h1> <form method="POST" enctype="multipart/form-data"> <input type="file" name="hashfile" /> <input type="submit" /> </form></body></html> ''' % {'title' : title}) return passwords = NullPasswordRepository() if password_file_name: passwords = FilePasswordRepository(password_file_name) mbox = file(mbox_fname, 'r') ourhashes = StringIO.StringIO() theirhashes = q['hashfile'].file outfile = sys.stdout # Ensure their hash file is valid; also get blocksize: blocksize, _, _, hash_iter = read_hashfile(theirhashes) if blocksize < minimum_block_size: raise BlockSizeUnacceptablySmall(minimum_block_size, blocksize) list(hash_iter) # read and decode the whole thing theirhashes.seek(0) # Print the header after as much as possible of the exception-prone # stuff has happened, but before the time-consuming stuff. sys.stdout.write('Content-Type: application/octet-stream\n' 'Content-Disposition: attachment; filename="syncfiles-updates.gz"\n\n') sys.stdout.flush() make_hashfile(mbox, ourhashes, NullPasswordRepository(), blocksize) ourhashes.seek(0) make_compressed_update_file(mbox=mbox, ourhashes=ourhashes, theirhashes=theirhashes, passwords=passwords, outfile=outfile) ### Command-line interface def usage(argv0): """Provide a command-line usage message.""" sys.stderr.write("""Usage: one of: %(prog)s [--blocksize size] [--passwords passwordfile] \\ --hash mailboxfile > hashfile %(prog)s [--passwords passwordfile] \\ --send ourhashes theirhashes < mailboxfile > updatefile.gz %(prog)s --receive mailboxfile updatefile.gz\n""" % { 'prog': argv0 }) return 1 def main(argv): """Command-line interface. See contents of 'usage' for documentation.""" blocksize = default_block_size passwords = NullPasswordRepository() while 1: if len(argv) < 2: return usage(argv[0]) if argv[1] == '--blocksize': if len(argv) < 3: return usage(argv[0]) blocksize = int(argv[2]) argv = [argv[0]] + argv[3:] # continue elif argv[1] == '--passwords': if len(argv) < 3: return usage(argv[0]) passwords = FilePasswordRepository(argv[2]) argv = [argv[0]] + argv[3:] # continue elif argv[1] == '--hash': infile = sys.stdin if len(argv) > 2: infile = file(argv[2]) make_hashfile(infile, sys.stdout, passwords, blocksize=blocksize) return 0 elif argv[1] == '--send': if len(argv) < 4: return usage(argv[0]) make_compressed_update_file(sys.stdin, file(argv[2]), file(argv[3]), passwords=passwords, outfile=sys.stdout) return 0 elif argv[1] == '--receive': if len(argv) < 4: return usage(argv[0]) apply_compressed_update_file(file(argv[2], 'r+'), file(argv[3])) return 0 else: return usage(argv[0]) if __name__ == '__main__': sys.exit(main(sys.argv)) Here's an example of how to call it as a CGI script: #!/usr/bin/python # This is an example of how to configure syncfiles --send as a CGI # script server. import sys sys.path.insert(0, '/home/kragen/devel/syncfiles') import syncfiles syncfiles.run_as_cgi( mbox_fname='/home/kragen/mail/currentmbox.panacea', title="Kragen's Mailbox Test", password_file_name='/tmp/passdir/some_passwords')