downloading mail batches (synchronizing large files efficiently)

Kragen Javier Sitaker Sat, 21 Oct 2006 00:37:10 -0700

Here's what I've been using for downloading my email.  It lets me
download 3 megabytes of email, or about 500 messages, in about two
minutes, compressed down to a little over 1 megabyte; and it lets me
do that without giving my password to an insecure machine in an
internet access center.  No encryption yet, though.


#!/usr/bin/python
"""Synchronize files in store-and-forward fashion.

My laptop currently may not have a working network connection, and in
Ecuador we often couldn't use any direct internet connections anyway
--- 'ciberes' or 'cafenetes' were everywhere, but few of them would
let us hook up our laptops to their Ethernet, and fewer of them had
wifi.  So often I would send email by putting a batch on an SD card,
going to the cafenet, and uploading it; until now, I didn't have a
similar approach for downloading.

Rsync
-----

My usual mail-management scheme is to rsync my inbox, or at least the
tail end of it, onto my laptop.  But rsync doesn't run over
store-and-forward.

rsync has this very ingenious scheme which avoids copying over any big
chunks of data you already have.  My understanding is that it works as
follows:

- the receiver takes the CRC-32 of each disjoint block of a certain
  fixed size in their copy, and sends them to the sender (so that, for
  example, if the block is 4kiB and the file is 1MiB, they send 256
  CRCs totaling 1kiB);
- the sender takes the CRC-32 of each block of that size in their
  file, but not just disjoint blocks --- so that, for example, if
  their copy of the file is 1MiB and the block size is 4kiB, they take
  1048576 - 4096 + 1 CRC-32s, and remember which ones represent blocks
  the receiver already has (and which blocks);
- the sender sends instructions on how to create the correct file by
  copying blocks from the receiver's version of the file and inserting
  new data where the receiver doesn't already have it.

Doing the second step in a brute-force way would be fairly slow --- in
this case, it would involve more than 4 billion CRC operations.
However, given the CRC-32 of 'x' + stuff (where '+' is string
concatenation), it's straightforward and efficient to compute the
CRC-32 not only of 'x' + stuff + 'y', but also of stuff + 'y'.
(Rabin-Karp string search works similarly.)

Unfortunately, I don't have a CRC-computing library handy, and I don't
understand CRC well enough to implement it myself.  Additionally, I
have another issue. I'm rsyncing my email, so other people have
control over the data I'm rsyncing; with a naive implementation of the
above algorithm, they could send me a chunk of data with the same CRC
as some later email they expect I will get, in order to prevent me
from getting the later email.  This is a problem because it's easy to
construct data with a particular CRC.  (In a less naive
implementation, the receiver would send a secure hash of each block as
well, which the sender would check before trusting the CRC.)

There's an additional problem that both the sender and the receiver
have to read the entire file before any of it gets transferred, which
is a problem when I'm being charged by the minute in a cafenet.

This program
------------

So I'm implementing something similar, but simpler, with the following
differences:
- it's store-and-forward;
- both the sender and the receiver work with only disjoint blocks of
  the file;
- they use (the first 128 bits of) the SHA-1 sum rather than CRC;
- I'm thinking about implementing an 'append-only' mode in which the
  sender assumes that every block except the last is correct, without
  checking.  But as long as it only takes 45 seconds to hash my
  whole mailbox, I'm not gonna bother.
- The block size is chosen so that the 'waste' data transmitted from
  the receiver is about the same size as the 'waste' data from the
  final block redundantly transmitted by the sender, for files of
  about the size of my mailbox, 1GB.  They are inversely proportional.
  The expected waste for the sender is half a block, compressed by a
  fixed factor of about 3; the waste sent by the receiver is
  16*filesize/blocksize.  16*filesize/blocksize = blocksize/3 when
  3*16*filesize = blocksize^2, which is close to 256kiB (2^18 bytes)
  for a filesize of 2^20 bytes.

In its current form, it takes about 45 seconds on my laptop to read my
862MiB mailbox, yielding a 55194-byte 'hashes' file.

File Format
-----------

This program uses two kinds of files: hash files and updates files.

A hash file begins with hashfile_magic_number (see the code), a
newline-terminated line containing the blocksize in ASCII, and a
newline-terminated line containing the original file size in ASCII.
The next line is a one-time password, which may be used by a server to
decide whether or not to answer a request.  The rest of the file is
binary hash data, 16 bytes per hash, all mushed together.

An updates file is gzipped, but when gunzipped is plain ASCII (as far
as the original file whose data is contained therein is plain ASCII,
anyway).  The gunzipped data begins with updatefile_magic_number (see
the code) and a newline-terminated line containing the blocksize in
ASCII.  Then follow two or more 'chunks'; each chunk begins with a
magic number, one of block_magic, done_magic, or trunc_magic (see the
code).

Chunks that begin with block_magic follow it with the block number (as
a newline-terminated ASCII number), the size of that block (usually
the file blocksize, but the last block in the file might be shorter)
(as a newline-terminated ASCII number), and the data of the block.
You'll need to count bytes to find the end of the data of the block.

A chunk that begins with done_magic has nothing else in it, and
terminates the file.

A chunk that begins with trunc_magic follows it with the original size
of the file, as a newline-terminated ASCII number.

TODO
----

D Add one-time passwords (optionally include in hashes file?)
  D add --passwords option to --hash
  D change hash file format to have a password line
  D add --passwords option to --send (and cgi interface)
D Restrict upload of arbitrarily large files (maybe with stdin proxy?)

"""

import sha, sys, gzip, cgi, cgitb, os, StringIO, itertools

### Magic numbers
# Generally changing these will lead to incompatibilities; they are
# used (some implicitly) in the file formats.
default_block_size = 2**18
hashsize = 16
hashfile_magic_number = 'syncfiles hashfile\n'
updatefile_magic_number = 'syncfiles updatefile\n'
block_magic = '\nblock\n'
done_magic  = '\ndone!\n'
trunc_magic = '\ntrunc\n'

### Exceptions

class ExtraDataInUpdateFile(Exception):
    """The updates file should end after its 'done!' token, but one didn't."""
class MismatchedBlockSize(Exception):
    """Receiver is using the wrong block size."""
class MismatchedMagicNumber(Exception):
    """A file has the wrong magic number
    and is therefore the wrong kind of file."""
class TruncatedHash(Exception):
    """A hash file ended in the middle of a hash value
    and is therefore probably incomplete."""
class UnexpectedEndOfUpdatesFile(Exception):
    """The updates file didn't have a 'done!' token
    and is therefore probably incomplete."""
class BlockSizeUnacceptablySmall(Exception):
    """The block size of the uploaded file is too small;
    it would lead to unacceptable memory consumption."""
class TooMuchDataUploaded(Exception):
    """A SizeLimitingStdinProxy got too much data.  Probably somebody
    tried to upload too much data to a CGI script."""
class InvalidPassword(Exception):
    """Somebody tried to authenticate a hash file with a password that
    didn't pass."""

### Password file handling

class NullPasswordRepository:
    """Password repository that validates any password and provides a
    dummy password when needed."""
    def password_is_valid(self, password): return True
    def get_password(self): return 'no password repository given'

class AtomicallyUpdatableFile:
    """Atomic updates for a text file.  Only readable by __iter__.

    The process calling this must have permission to read and delete
    the file, although not necessarily to write it, and to create
    files in the same directory.

    Depends on Unix atomic rename().
    """
    def __init__(self, filename): self.filename = filename
    def __iter__(self):
        """Iterate over lines of current contents of file."""
        return iter(file(self.filename))
    def update(self, iterable):
        """Put results of iterable into new version of file, atomically.

        You can still have lost-updates problems with this if you have
        multiple concurrent updaters, which (XXX!) might potentially
        be a security problem allowing password-guessing.
        """
        nfn = self.filename + '.new'
        out = file(nfn, 'w')
        try:
            for item in iterable: out.write(item)
            out.close()
            os.rename(nfn, self.filename)
        finally:
            if os.path.exists(nfn): os.remove(nfn)

class FilePasswordRepository:
    """Password repository that uses a text file, one password per line."""
    def __init__(self, filename):
        self.file = AtomicallyUpdatableFile(filename)
    def __repr__(self):
        return 'FilePasswordRepository(%r)' % self.file.filename
    def passwords(self):
        """Iterate over passwords from the file."""
        for line in self.file: yield line.rstrip('\n')
    def password_is_valid(self, password):
        """If 'password' is in the file, remove it and return True.

        Otherwise, return false.
        """
        for possible_password in self.passwords():
            if password == possible_password:
                self.eliminate_password(password)
                return True
        return False
    def get_password(self):
        """Choose a password from the file (removing it)."""
        rv = self.passwords().next()
        self.eliminate_password(rv)
        return rv
    def eliminate_password(self, password):
        """Internal: remove a password from the file."""
        self.file.update(line for line in self.file
                         if line.rstrip('\n') != password)

### Hashfile and update file handling

class NumberPacker:
    """Serializes and deserializes numbers.

    The interface is pretty much the same as that of the 'Struct'
    object I wrote for mail indexing, but the format is
    newline-delimited ASCII.
    """
    def pack(self, num):
        """Return an ASCII representation of a number."""
        return '%d\n' % num
    def read(self, fo):
        """Read a number from a file object, returning it in a 1-tuple."""
        line = fo.readline()
        return self.unpack(line)
    def unpack(self, data):
        """Deserialize a number from its ASCII representation.

        Returns it in a 1-tuple (which is kind of silly, but is for
        compatibility with Struct.)
        """
        return (int(data),)
    def write(self, fo, num):
        """Write a number to a file object."""
        fo.write(self.pack(num))
number = NumberPacker()


def hashes(fileobj, blocksize=default_block_size):
    """Iterate over the SHA-1 hashes of the blocks of a file."""
    while 1:
        data = fileobj.read(blocksize)
        if not data: break
        yield sha.sha(data).digest()[:hashsize]

def make_hashfile(fileobj, fileout, passwords, blocksize=default_block_size):
    """Write out a hashfile in the format expected by this program."""
    fileout.write(hashfile_magic_number + number.pack(blocksize))
    fileobj.seek(0, 2)  # to EOF
    fileout.write(number.pack(fileobj.tell()))  # length of file
    password = passwords.get_password()
    assert '\n' not in password
    fileout.write(password + '\n')
    
    fileobj.seek(0)
    for hash in hashes(fileobj, blocksize): fileout.write(hash)

def _read_hashfile(infile):
    """Iterate over the hashes read from the body of a hashfile."""
    while 1:
        next_hash = infile.read(hashsize)
        if not next_hash: return
        if len(next_hash) != hashsize:
            raise TruncatedHash(repr(next_hash))
        yield next_hash

def read_hashfile(infile):
    """Decode a hashfile, given an open file object on it.

    Returns a tuple of blocksize, original file size, password, and an
    iterator over the hashes.
    """
    magic_number = infile.read(len(hashfile_magic_number))
    if magic_number != hashfile_magic_number:
        raise MismatchedMagicNumber(hashfile_magic_number, magic_number)
    blocksize = number.read(infile)[0]
    input_file_size = number.read(infile)[0]
    password = infile.readline().rstrip('\n')
    return blocksize, input_file_size, password, _read_hashfile(infile)

def make_update_file(mbox, ourhashes, theirhashes, passwords, outfile):
    """Make an uncompressed update file for bringing somebody up to date.

    mbox: the 'master' version of the file to bring them up to date with
    ourhashes: an up-to-date hash file of the 'master' version
    theirhashes: a hash file describing what they already have.  Must have
        same block size as 'ourhashes'.
    passwords: a password repository to authenticate the hashfile with (use
        NullPasswordRepository to not authenticate)
    outfile: a file object to write the updates file to.
    """
    ourblocksize, our_file_size, _, ourhashes = read_hashfile(ourhashes)
    (theirblocksize, their_file_size,
     password, theirhashes) = read_hashfile(theirhashes)
    if not passwords.password_is_valid(password):
        raise InvalidPassword(password, passwords)
    if ourblocksize != theirblocksize:
        raise MismatchedBlockSize(ourblocksize, theirblocksize)
    outfile.write(updatefile_magic_number + number.pack(ourblocksize))
    for ourhash, blocknum in itertools.izip(ourhashes, itertools.count()):
        try: theirhash = theirhashes.next()
        except StopIteration: theirhash = ''
        if ourhash != theirhash:
            outfile.write(block_magic)
            number.write(outfile, blocknum)
            mbox.seek(blocknum * ourblocksize)
            data = mbox.read(ourblocksize)
            number.write(outfile, len(data))
            outfile.write(data)
    outfile.write(trunc_magic)
    number.write(outfile, our_file_size)
    outfile.write(done_magic)
    outfile.close()

def make_compressed_update_file(mbox, ourhashes, theirhashes, passwords,
                                outfile):
    """Make a compressed update file with GzipFile.

    See make_update_file for meanings of arguments.
    """
    make_update_file(mbox, ourhashes, theirhashes, passwords,
                     gzip.GzipFile('', 'w', 9, outfile))
    outfile.close()

def apply_update_file(mbox, updatefile):
    """Given an uncompressed update file, update our version of the file.

    So we have a file produced by make_update_file and the file from
    which the 'theirhashes' file was produced; this applies the
    updates specified in 'updatefile' in order to make our 'mbox'
    exactly like their 'mbox'.
    """
    magic = updatefile.read(len(updatefile_magic_number))
    if magic != updatefile_magic_number:
        raise MismatchedMagicNumber(updatefile_magic_number, magic)
    blocksize = number.read(updatefile)[0]
    assert len(block_magic) == len(done_magic) == len(trunc_magic)
    while 1:
        magic = updatefile.read(len(block_magic))
        if magic == block_magic:
            blocknum = number.read(updatefile)[0]
            mbox.seek(blocknum * blocksize)
            this_block_size = number.read(updatefile)[0]
            mbox.write(updatefile.read(this_block_size))
        elif magic == done_magic:
            extra = updatefile.read(80)
            if extra: raise ExtraDataInUpdateFile(repr(extra))
            break
        elif magic == trunc_magic:
            file_length = number.read(updatefile)[0]
            mbox.truncate(file_length)
        elif magic == '':
            raise UnexpectedEndOfUpdatesFile()

def apply_compressed_update_file(mbox, updatefile):
    """Same as apply_update_file, but updatefile is assumed to be gzipped."""
    apply_update_file(mbox, gzip.GzipFile('', 'r', 9, updatefile))

### CGI interface

class SizeLimitingStdinProxy:
    """Support cgi.FieldStorage's use of stdin while limiting upload size.

    Only supports the very minimal part of the file protocol
    cgi.FieldStorage currently uses on self.fp: .read(size) (no
    default) and .readline().

    When using cgitb, this seems to result in deadlock when run in CGI
    under (at least) Debian Apache 1.3.33-4: Apache is trying to write
    uploaded data to the CGI program, but the CGI program is trying to
    write an error message to Apache, and both pipes are full.  After
    a few minutes, Apache's child has a SIGALRM and kills the CGI
    program.  But deadlock is an OK result --- it prevents attackers
    from filling up the disk!  It makes it easy for them to deny
    service to the web server, at least for a few minutes.
    """
    def __init__(self, fp=sys.stdin, maxbytes = 2*1024*1024):
        """fp: the file to limit the size on.
        maxbytes: the maximum number of bytes to allow to be uploaded.
        """
        self.fp = fp
        self.maxbytes = self.remaining_bytes = maxbytes
    def read(self, size):
        """Proxy .read method of underlying filehandle while limiting bytes."""
        rv = self.fp.read(size)
        self.use_up(len(rv))
        return rv
    def readline(self):
        """Proxy .readline of underlying filehandle while limiting bytes."""
        rv = self.fp.readline()
        self.use_up(len(rv))
        return rv
    def use_up(self, amount_of_data):
        """Account for amount_of_data bytes being uploaded, possibly raising.

        If the total number of bytes used up so far is too great,
        raises TooMuchDataUploaded.
        """
        self.remaining_bytes -= amount_of_data
        if self.remaining_bytes < 0: raise TooMuchDataUploaded(self.maxbytes)

def run_as_cgi(mbox_fname, title, password_file_name=None,
               minimum_block_size=2**13):
    """Provide a CGI interface to make_compressed_update_file.

    mbox_fname: the path to the master file to serve up parts of
    title: how to describe it in the <title>
    password_file_name: the path to a file of one-time passwords for
        validating requests (default is don't validate)
    minimum_block_size: the smallest block size to accept in an uploaded
        hashfile (to keep memory consumption under control).

    The idea is that you upload a hash file (probably produced by
    running syncfiles.py --hash on the command line) and get a
    downloadable updates file in response.  You can upload a hash file
    of any blocksize at all.  Provides a very minimal HTML interface
    to facilitate the uploading.
    """
    cgitb.enable()
    q = cgi.FieldStorage(SizeLimitingStdinProxy())
    if os.environ.get("REQUEST_METHOD") != 'POST':
        sys.stdout.write('''Content-Type: text/html\n
        <html><head><title>Synchronize File %(title)s</title></head><body>
        <h1>Synchronize File %(title)s</h1>
        <form method="POST" enctype="multipart/form-data">
        <input type="file" name="hashfile" />
        <input type="submit" />
        </form></body></html>
        ''' % {'title' : title})
        return

    passwords = NullPasswordRepository()
    if password_file_name:
        passwords = FilePasswordRepository(password_file_name)
    mbox = file(mbox_fname, 'r')
    ourhashes = StringIO.StringIO()
    theirhashes = q['hashfile'].file
    outfile = sys.stdout
    # Ensure their hash file is valid; also get blocksize:
    blocksize, _, _, hash_iter = read_hashfile(theirhashes)
    if blocksize < minimum_block_size:
        raise BlockSizeUnacceptablySmall(minimum_block_size, blocksize)
    list(hash_iter)  # read and decode the whole thing
    theirhashes.seek(0)
    # Print the header after as much as possible of the exception-prone
    # stuff has happened, but before the time-consuming stuff.
    sys.stdout.write('Content-Type: application/octet-stream\n'
        'Content-Disposition: attachment; filename="syncfiles-updates.gz"\n\n')
    sys.stdout.flush()
    make_hashfile(mbox, ourhashes, NullPasswordRepository(), blocksize)
    ourhashes.seek(0)
    make_compressed_update_file(mbox=mbox, ourhashes=ourhashes,
                                theirhashes=theirhashes, passwords=passwords,
                                outfile=outfile)
    

### Command-line interface

def usage(argv0):
    """Provide a command-line usage message."""
    sys.stderr.write("""Usage: one of:
    %(prog)s [--blocksize size] [--passwords passwordfile] \\
        --hash mailboxfile > hashfile
    %(prog)s [--passwords passwordfile] \\
        --send ourhashes theirhashes < mailboxfile > updatefile.gz
    %(prog)s --receive mailboxfile updatefile.gz\n""" % {
        'prog': argv0
    })
    return 1
   
def main(argv):
    """Command-line interface.  See contents of 'usage' for documentation."""
    blocksize = default_block_size
    passwords = NullPasswordRepository()
    while 1:
        if len(argv) < 2: return usage(argv[0])
        if argv[1] == '--blocksize':
            if len(argv) < 3: return usage(argv[0])
            blocksize = int(argv[2])
            argv = [argv[0]] + argv[3:]
            # continue
        elif argv[1] == '--passwords':
            if len(argv) < 3: return usage(argv[0])
            passwords = FilePasswordRepository(argv[2])
            argv = [argv[0]] + argv[3:]
            # continue
        elif argv[1] == '--hash':
            infile = sys.stdin
            if len(argv) > 2: infile = file(argv[2])
            make_hashfile(infile, sys.stdout, passwords, blocksize=blocksize)
            return 0
        elif argv[1] == '--send':
            if len(argv) < 4: return usage(argv[0])
            make_compressed_update_file(sys.stdin, file(argv[2]), file(argv[3]),
                                        passwords=passwords, outfile=sys.stdout)
            return 0
        elif argv[1] == '--receive':
            if len(argv) < 4: return usage(argv[0])
            apply_compressed_update_file(file(argv[2], 'r+'), file(argv[3]))
            return 0
        else:
            return usage(argv[0])

if __name__ == '__main__': sys.exit(main(sys.argv))



Here's an example of how to call it as a CGI script:

#!/usr/bin/python
# This is an example of how to configure syncfiles --send as a CGI
# script server.
import sys
sys.path.insert(0, '/home/kragen/devel/syncfiles')
import syncfiles
syncfiles.run_as_cgi(
    mbox_fname='/home/kragen/mail/currentmbox.panacea',
    title="Kragen's Mailbox Test",
    password_file_name='/tmp/passdir/some_passwords')

downloading mail batches (synchronizing large files efficiently)

Reply via email to