#!/usr/bin/python
# -*- coding: utf-8 -*-
"""Measuring the effectiveness of huffman coding.

Francis Bacon’s invention of Baudot code
----------------------------------------

Francis Bacon invented a code called the “biliteral alphabet”, which
he described as a means of making anything mean anything, in which the
coded message was represented by sequences of two possible letters,
which you could represent as “a” and “b”; you could encode, for
example, “A” as “aaaaa”, “B” as “aaaab”, “C” as “aaaba”, “D” as
“aaabb”, and so on.  I’m not totally clear on how much of this was
explained in Bacon’s work and how much was discovered in the 1890s.

There is an excellent article on this in the Winter 2010/11 issue of
Cabinet magazine, [How to Make Anything Signify Anything][0|, by
William H. Sherman.

[0|: http://www.cabinetmagazine.org/issues/40/sherman.php

An ingenious aspect of this code is that anything that can appear in
sequences, each of which can be in one of two recognizable states, can
be used to bear the message.  Apparently Bacon used two fonts of type,
such that any letter drawn from the first font represented an “a”, and
any letter drawn from the second font represented a “b”.  This allowed
him to encode a secret message in any printed text with only a
fivefold expansion in the message size.  Of course, the two fonts were
in a single harmonious style, so it might not be obvious from looking
which letters belonged to which font; and in cases of handwriting,
still more variation is possible.

Variable-length codes for better efficiency and the huffman algorithm
---------------------------------------------------------------------

>From a modern perspective, of course, what Bacon invented was the
Baudot code, which represents letters and potentially other characters
as binary sequences; and a natural thing to wonder is whether encoding
efficiency could have been improved with a variable-length code, like
Morse Code, in which the more common letters are encoded with shorter
code sequences.  Such a code is still eminently practical to encode
and decode by hand.  The huffman algorithm computes an “optimal”
version of such a code.  For the letters of the Project Gutenberg
version of the King James Bible, a huffman code as computed by this
program is:

    ('000000', 'v')
    ('0000010000', 'q')
    ('0000010001', 'x')
    ('000001001', 'z')
    ('00000101', 'j')
    ('0000011', 'k')
    ('00001', 'm')
    ('0001', 'd')
    ('001', 't')
    ('01000', 'u')
    ('01001', 'f')
    ('0101', 'r')
    ('0110', 's')
    ('0111', 'i')
    ('100', 'e')
    ('101000', 'p')
    ('101001', 'b')
    ('101010', 'c')
    ('101011', 'g')
    ('1011', 'n')
    ('1100', 'o')
    ('110100', 'y')
    ('110101', 'w')
    ('11011', 'l')
    ('1110', 'a')
    ('1111', 'h')

So you see the most common letters (‘e’ and ‘t’) are represented by
only 3 bits, other common letters (‘a’, ‘d’, ‘h’, ‘i’, ‘r’, ‘s’, ‘n’,
‘o’) are represented by 4 bits, and the least common letters ‘q’ and
‘z’ have 9-bit codes.  And no valid code begins with any other valid
code --- for example, ‘e’ has been assigned ‘100’, so no other letter
can have a code that begins with ‘100’.  This avoids any need for
marking where the code for one letter ends and the next begins.

Encoding the 3 239 443 letters of this book with this code, we need,
on average, only 4.15 bits per letter rather than 5.

Context-dependent variable-length codes
---------------------------------------

A somewhat more difficult, but still practical, approach is to use a
number of different variable-length codes depending on context.  For
example, in English, after a ‘j’, it is quite unlikely to find any
consonant (in Bacon’s day, ‘j’ was not a distinct letter, so this did
not apply) and after ‘q’ it is quite unlikely to find anything but
‘u’.  If we use 26 different huffman codes instead of one, then
encoding and decoding will be trickier, but we can achieve better
efficiency.  Using such a set of codes for this same version of the
Bible, we reduce the space needed from 4.15 bits per letter to 3.5.
However, the code used is not complete; because every ‘q’ in the text
is followed by a ‘u’, for example, the ‘u’ in that position is
assigned the zero-length code ‘’, and it is therefore impossible to
represent any other letter following a ‘q’; and similarly, it is only
possible to represent vowels following a ‘j’.  (‘e’ in that position
is assigned the code ‘0’, ‘o’ the code ‘10’.)  Making the code
complete might worsen its efficiency slightly.  For example, 'u'
following 'q' would require at least one bit.

Optimizing huffman codes for manual use
---------------------------------------

Bacon’s original invention was intended for human beings to read and
write, not automatic computers.  In modern use of huffman codes, it is
common to transmit the huffman table along with the encoded text, but
I believe that for human beings, it would probably be more practical
to standardize on a common code.

Today it is more common to use character codes that represent more
than just the letters of a text; ASCII includes case, spaces,
punctuation, and numbers, for example, while Unicode also includes
accents, not to mention other alphabets.  If we encode the entire
text, including all of these changes, with a single huffman code, it
requires 4.58 bits per character, a total of 19939830 bits; if we use
separate huffman tables following each character, it's 3.42 bits per
character, 14874433 bits in all.  This compares to 13450484 bits to
represent all the letters with a single huffman table and 11339286 bits
to represent them with 26 different tables.

I wrote another program (previously posted to kragen-hacks under the
rubric “reducing charset size for compressibility with case-shift
characters (in Python)”
<http://lists.canonical.org/pipermail/kragen-hacks/2011-April/000520.html>)
to reduce the number of different characters that need to be encoded,
by using a single code for each letter, and changing between
upper-case and lower-case letters with special “control characters”.
It uses \x11 to switch to lowercase, \x12 to switch to uppercase, \x13
to switch to digits (0 is a, 1 is b, and so on), and \x14 to switch
only the next character to uppercase.  This way, the poor monk who’s
manually enciphering and deciphering text only needs about 40
characters in his codebook for ordinary text, and only 92 for all of
ASCII.

You might expect that he would have to do more work, though, since the
case-shift control characters are more things he has to encipher. But
it also allows the huffman algorithm to use shorter codes for the
letters.  It turns out that these effects cancel out almost exactly,
and the difference on my test file is about 0.08%.  The code I came
out with, after a little bit of manual tweaking, is as follows:

### Optimized binary code for manual encoding ###

    ('0000', 'i')
    ('0001 0', '\n') (line break. Could use for paragraph breaks.)
    ('0001 1000 00', "'")
    ('0001 1000 01', 'z')
    ('0001 1000 10', '?')
    ('0001 1000 110', 'x')
    ('0001 1000 1110', 'q')
    ('0001 1001', 'j')
    ('0001 101', '.')
    ('0001 11', 'y')
    ('001', 'e')
    ('0100', 'n')
    ('0101', 'o')
    ('0110 000', 'v')
    ('0110 001', '\x13')  (switch to numbers: 0 for a, 1 for b, etc.)
    ('0110 01', 'g')
    ('0110 1', 'l')
    ('0111', 'a')
    ('1000 00', 'w')
    ('1000 01', ',')
    ('1000 10', 'c')
    ('1000 11', 'b')
    ('1001', 'h')
    ('1010', 't')
    ('1011 000', '\x11')  (switch to lowercase)
    ('1011 0010 0', '\x12') (switch to uppercase)
    ('1011 0010 1', ';')
    ('1011 0011', 'k')
    ('1011 01', 'm')
    ('1011 1', 'r')
    ('1100 0', 'd')
    ('1100 10', 'u')
    ('1100 11', '\x14') (next character is an uppercase letter)
    ('1101 000', 'p')
    ('1101 001', ':')
    ('1101 01', 'f')
    ('1101 1', 's')
    ('111', ' ')

0001 1000 1111... was a bunch of poorly-represented characters, which
appeared only in the Project Gutenberg blurb.  It remains open for
further expansion with rarely-used characters.

This encodes the test file at 4.45 bits/char, including the control
characters, or 4.6 per char if you exclude them.

#### Encoding key ####

    ("'", '0001 1000 00')
    (' ', '111')
    (',', '1000 01')
    ('.', '0001 101')
    (':', '1101 001')
    (';', '1011 0010 1')
    ('?', '0001 1000 10')
    ('\n', '0001 0') (line break. Could use for paragraph breaks.)
    ('\x11', '1011 000')  (switch to lowercase)
    ('\x12', '1011 0010 0') (switch to uppercase)
    ('\x13', '0110 001')  (switch to numbers: 0 for a, 1 for b, etc.)
    ('\x14', '1100 11') (next character is an uppercase letter)
    ('a', '0111')
    ('b', '1000 11')
    ('c', '1000 10')
    ('d', '1100 0')
    ('e', '001')
    ('f', '1101 01')
    ('g', '0110 01')
    ('h', '1001')
    ('i', '0000')
    ('j', '0001 1001')
    ('k', '1011 0011')
    ('l', '0110 1')
    ('m', '1011 01')
    ('n', '0100')
    ('o', '0101')
    ('p', '1101 000')
    ('q', '0001 1000 1110')
    ('r', '1011 1')
    ('s', '1101 1')
    ('t', '1010')
    ('u', '1100 10')
    ('v', '0110 000')
    ('w', '1000 00')
    ('x', '0001 1000 110')
    ('y', '0001 11')
    ('z', '0001 1000 01')

Other compression algorithms
----------------------------

This is far from the performance of modern compression algorithms:

    |---------------------------------------------------------+-------------|
    | encoding                                                | output bits |
    |---------------------------------------------------------+-------------|
    | ASCII                                                   | 34 814 776  |
    | Bacon (5-bit, letters only)                             | 16 197 215  |
    |---------------------------------------------------------+-------------|
    | simple huffman with case-shift coding (the table above) | 20 094 861  |
    | simple huffman                                          | 19 939 830  |
    | context-sensitive huffman                               | 14 874 433  |
    | simple huffman (letters only)                           | 13 450 484  |
    | context-sensitive huffman (letters only)                | 11 339 286  |
    |---------------------------------------------------------+-------------|
    | compress (LZW)                                          | 12 595 576  |
    | gzip -9 (LZ77, huffman)                                 | 10 968 576  |
    | LZMA                                                    | 8 319 576   |
    | bzip2 -9 (BWT, MTF, huffman, etc.)                      | 8 022 672   |
    |---------------------------------------------------------+-------------|

However, I don’t believe that any of the more modern algorithms are
practical to carry out by hand --- even LZW, which is roughly as
simple as huffman coding in software.

Errors
------

When considering human encoding, it's important to think about the
effect of errors.

Huffman coding differs from fixed-width coding (e.g. ASCII, Baudot) in
its response to errors.  With fixed-width coding, a bit substitution
creates a single-character error, but the rest of the text is
unharmed.  Huffman coding can be desynchronized by a bit substitution
error, but it will eventually resynchronize.  Fixed-width coding, by
contrast, becomes permanently desynchronized by bit insertion and
deletion errors, while Huffman coding will eventually resynchronize in
these cases as well.

Fixed-width coding errors will tend to be fairly obvious; the
substituted letter will usually be an uncommon letter, and one that
wouldn’t make sense in context.  Huffman coding, by contrast, will
tend to prefer common letters, and the context-sensitive huffman
coding I suggest using here will additionally tend to produce
plausible *sequences* of letters, dissociated-press style, and
occasionally even words.  The superstitious might amuse themselves by
generating random bits from a series of coin flips, decoding the
results, and attempting to interpret them as a supernatural message.

As an example, using the context-sensitive huffman method, the first
verse of Genesis 'inthebeginninggodcreatedtheheavenandtheearth'
encodes as ‘000101111101111000111101110000101101101100001001010001100’
‘10110010111101101000001100101010110101001100000101111101011110101101’
‘011001000111011101’.  If we change the first bit to from ‘0’ to ‘1’,
we get the result ‘tanthestjendtteldsomenoondspawndsanergtho’, which
consists mostly of words: “tan the stj end tteld some no on d spawn ds
an erg tho”.  If instead of substituting, we prepend a ‘1’, we get
instead ‘omthgsulivesndlndtheatedtheheavenandtheearth’ --- it
resynchronized partway through, giving us “Omthgsu Live Sndlndth eated
the heaven and the earth”, a sort of Lovecraft edit of Genesis.

Using the context-sensitive huffman tables for the full ASCII version
of the PG KJV, inserting a '1' at the beginning transforms “this is a
simple piece of text.  It should be relatively easy to correct errors
in before too long, given the self synchronizing property of huff man
codes.” into the following:

    nd, msoug thef perunodix
    10 thendethaicowan thth y whersemesof er I the plond
    ofunewin ileathinse,) t th t r
    ed themicut y it ghity chit hronizing property of huff man codes.

I had to search for a while to find an error that left it
desynchronized for such a long time, though.

It's not impossible to construct a text that fails to resynchronize
for an arbitrarily long time; simple repetition will generally do it.

Consequently, although encoding and decoding errors are likely to be
easily recoverable in this scheme, substitution errors will have a
bigger effect, and the erroneous decoded text will appear much more
plausible than with Bacon's less efficient method.

"""

import re, collections, sys

def huff(freqs):
    "Given a dictionary of letter frequencies, compute a huffman tree for them."
    items = [(v, k) for k, v in freqs.items()]

    while len(items) > 1:
        # find the two smallest items.
        items.sort()      # heapq would be faster, but this is simpler
        ((firstv, firstk), (secondv, secondk)) = items[:2]

        # combine them.
        new_item = (firstv + secondv, (firstk, secondk))
        items[0] = new_item
        items[1] = items[-1]
        items.pop()

    return items[0][1]

def mkcode(tree, prefix=''):
    "Given a huffman tree, generate the code for it."
    if type(tree) is type(()):
        for item in mkcode(tree[0], prefix + '0'): yield item
        for item in mkcode(tree[1], prefix + '1'): yield item
    else:
        yield prefix, tree

def letters_only(infile):
    "Strip out nonletters and case information from a sequence of lines."
    for line in infile:
        yield re.sub('[^a-z]', '', line.lower())

def freq(infile):
    "Tabulate the frequencies of characters in a sequence of lines."
    rv = collections.defaultdict(lambda: 0)
    for line in infile:
        for char in line:
            rv[char] += 1
    return rv

class SimpleEncoder:
    "Encodes according to a simple huffman tree."

    def __init__(self, tree):
        self.dict = dict((letter, code) for code, letter in mkcode(tree))

    def encode(self, instring):
        return ''.join(self.dict[char] for char in instring)

# This is kind of a silly way to do this.  You could simply calculate
# them from the frequency tables used to compute the huffman code in
# the first place.
def evaluate_code(encoder, infile):
    "Compute input and output size for an encoder applied to a file."
    in_chars = 0
    out_chars = 0

    for line in infile:
        in_chars += len(line)
        out_chars += len(encoder.encode(line))

    return in_chars, out_chars

def freq_follow(infile):
    "Tabulate character frequencies for single-character contexts."

    rvs = collections.defaultdict(lambda: collections.defaultdict(lambda: 0))
    prev_char = 'a'

    for line in infile:
        for char in line:
            rvs[prev_char][char] += 1
            prev_char = char

    return rvs

class AdaptiveEncoder:
    "Encode with a different huffman table for each single-character context."

    def __init__(self, freqs):
        trees = [(prev_char, huff(next_char_freqs))
                 for prev_char, next_char_freqs in freqs.items()]

        self.dicts = {}
        self.decode_dicts = {}
        for prev_char, tree in trees:
            self.dicts[prev_char] = dict((letter, code)
                                         for code, letter in mkcode(tree))
            self.decode_dicts[prev_char] = dict(mkcode(tree))

        self.reset()

    def reset(self):
        self.prev_char = 'a'

    def encode(self, instring):
        rv = []
        for char in instring:
            rv.append(self.dicts[self.prev_char][char])
            self.prev_char = char
        return ''.join(rv)

    def decode(self, encoded):
        rv = []
        buf = ''
        bits = iter(encoded)
        while True:
            if buf in self.decode_dicts[self.prev_char]:
                self.prev_char = self.decode_dicts[self.prev_char][buf]
                rv.append(self.prev_char)
                buf = ''
            else:
                try:
                    bit = bits.next()
                except StopIteration:
                    return ''.join(rv)
                buf += bit

def test(input_file):
    "Try different codes on a particular input file."

    print "Single huffman table:"
    freqs = freq(input_file())
    print freqs

    tree = huff(freqs)
    print tree
    for item in mkcode(tree):
        print item

    in_chars, out_chars = evaluate_code(SimpleEncoder(tree), input_file())
    print "%d chars in, %d bits out, %.2f bits/char" % (
           in_chars,    out_chars,   float(out_chars)/in_chars)

    print
    print "Using one-character context:"

    ffs = freq_follow(input_file())
    print ffs

    trees = [(prev_char, huff(next_char_freqs))
             for prev_char, next_char_freqs in ffs.items()]

    trees.sort()
    for prev_char, tree in trees:
        print "following %r:" % prev_char
        for item in mkcode(tree): print item
        print

    in_chars, out_chars = evaluate_code(AdaptiveEncoder(ffs), input_file())
    print "with 1-char adaptive, %d chars in, %d bits out, %.2f bits/char" % (
                                 in_chars, out_chars, float(out_chars)/in_chars)

if __name__ == '__main__':
    print "Letters only:"
    test(lambda: letters_only(file(sys.argv[1])))
    print
    print "All characters:"
    test(lambda: file(sys.argv[1]))
    

# Local Variables:
# compile-command: "./huffman.py"
# End:
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks

Reply via email to