[music-dsp] about entropy encoding

Peter S Mon, 06 Jul 2015 10:38:47 -0700

Hi All,

Some months ago there was a heated debate where some proponents
claimed, "it is impossible to estimate the entropy of an arbitrary
message, because we cannot calculate the probability". I still think
that claim is a mistake, and I'll prove it using a simple, real-world
example.


Consider the following, arbitrary message:

        DBDCDBCCCBCDDDADDBCA

In this message, there are 4 symbols: A, B, C, and D.
Let's count the occurrences of each:

    A: 2
    B: 4
    C: 6
    D: 8

The total number of symbols is 20. So, the probabilities of each
symbol in this message are:

    A: 2/20 = 0.1
    B: 4/20 = 0.2
    C: 6/20 = 0.3
    D: 8/20 = 0.4

By simply counting the occurrences of each symbol, I already estimated
the probability of each symbol in the message. Now let's calculate the
optimal code length of each symbol:

"According to Shannon's source coding theorem, the optimal code length
for a symbol is -log_b (P), where b is the number of symbols used to
make output codes and P is the probability of the input symbol."

(source: https://en.wikipedia.org/wiki/Entropy_encoding)

So, if we want to encode the above symbols using binary, which is 2
symbols (0 and 1), let's calculate the code lengths using Shannon's
above formula:

    A: -log2(0.1) = 3.321928094887362
    B: -log2(0.2) = 2.321928094887362
    C: -log2(0.3) = 1.7369655941662063
    D: -log2(0.4) = 1.3219280948873622

In other words, the optimal code lengths to encode each symbol,
according to Shannon, are the above, so an optimal data compressor
will encode them using around this many bits. If we multiply that by
the number of occurrences in the message, we'll get the theoretically
"optimal" compressed message size in bits, which is the "entropy" of
the message in bits:

    (-log2(0.1)*2) + (-log2(0.2)*4) + (-log2(0.3)*6) + (-log2(0.4)*8)
= 36.92878689342031.

This message, using optimal compression, will be about 37 bits long,
so practically this is the "entropy" of this message. To confirm this,
let's compress the above message by creating a Huffman tree, which is
an optimal prefix code, in other words, the shortest possible prefix
code. It is built by sorting the messages based on number of
occurrences (= probabilities):

    D: 8
    C: 6
    B: 4
    A: 2

Then, simply build a tree by merging the lowest two branches of the
tree with the smallest weights, until the whole tree is complete (best
viewed with monospaced font):

    D: 8 ----------------------------\__[DCBA]:20__
    C: 6 ---------------\__[CBA]:12__/
    B: 4 ----\__[BA]:6__/
    A: 2 ----/

To get the prefix codes for each symbol, simply traverse the tree from
the root, and write down a 0 at every up turn, and 1 for every down
turn (or vice versa, doesn't matter):

    D: 0
    C: 10
    B: 110
    A: 111

That is a possible list of prefix codes for using Huffman encoding for
the message "DBDCDBCCCBCDDDADDBCA".

Notice, the length of these codes are within one bit of the
theoretical optimal length code calculated earlier using Shannon's
formula:

    D: length(0)   = 1, -log2(0.4) = 1.3219280948873622
    C: length(10)  = 2, -log2(0.3) = 1.7369655941662063
    B: length(110) = 3, -log2(0.2) = 2.321928094887362
    A: length(111) = 3, -log2(0.1) = 3.321928094887362

Using this prefix code, the message "DBDCDBCCCBCDDDADDBCA" can be
encoded as binary, as follows:

    0 110 0 10 0 110 10 10 10 110 10 0 0 0 111 0 0 110 10 111

Or, written together:

    01100100110101010110100001110011010111

...which has a length of 38 bits, which is just above what Shannon's
theoretical "optimal code length" formula gave us (~37 bits). Huffman
code gives provably 'optimal length' prefix code, so no data
compressor will compress the above message in a string of bits shorter
than 38 bits.

The original, message, encoded as 8-bit ASCII is 8*20 = 160 bits long.
Encoded as 5-bit Baudot code, it would be 5*20 = 100 bits long.
Encoded as 2-bit binary (00=A, 01=B, 10=C, 11=D), it would be 2*20 =
40 bits long.

Compared to ASCII, message size is 23.75% (76.25% compression)
Compared to Baudot, message size is 38% (62% compression).
Compared to 2-bit encoding, message size is 95% (5% compression, but
that does not count the codebook overhead).

So in this simple example, I estimated the "optimal code length" (~=
"entropy" in bits) of the arbitrary message "DBDCDBCCCBCDDDADDBCA" as
the sum of the estimated entropy of symbols costituting the message by
simply counting their occurrences, which is ~36.9 bits (using
Shannon's theoretical optimum formula), or 38 bits (using Huffman
encoding, which is provably optimal prefix code).

This simple estimate can be repeated for any, arbitrarily chosen
message. I created a simple program that does this automatically:

        http://morpheus.spectralhead.com/entropy/entest.tcl.txt

When ran, it reads input from standard input until EOF (Control-Z on
Windows, Control-D on *nix), then prints a table of the characters in
the input, their occurence count and estimated probability, and their
estimated entropy in bits (using Shannon's formula). Finally, it
prints the estimated total entropy in bits (and for large inputs, as
kilobytes as well). Optionally, a file name to be processed can be
given as an argument.

So, in other words, this program calculates

        SUM [i=1..N] -log2( count_i/T )*count_i
        
...where T is the total number of symbols (symbols assumed to be
individual characters), count_i is the number of occurrences of the
ith symbol, and count_i/T is the estimated probability of that symbol
in the message. So, when ran on the message "DBDCDBCCCBCDDDADDBCA", it
will print this:

Symbol Count Probab.  Entropy
-------------------------------
    D: 8     p=0.400, ent=1.322
    C: 6     p=0.300, ent=1.737
    B: 4     p=0.200, ent=2.322
    A: 2     p=0.100, ent=3.322
Estimated total entropy: 36.9 bits

When testing this program, I found that it gives a rough estimate of
how well a file can be compressed, with for example, PKZIP. Usually,
compression algorithms will result in a smaller file size as they use
dictionary based compression, instead of just encoding every character
based on its probability by occurrence count. Since I was curious how
well this formula estimates the compressability of arbitrary data
(which inversely correlates with entropy), I downloaded about a
gigabyte of various text files from the textfiles.com archive (a total
of 44,618 files), and ran this estimator on each of them (this program
ran for hours). Here are the raw results [2.7 MB]:

        http://morpheus.spectralhead.com/entropy/entest-results.txt

I found that on average, the error between the entropy estimate and
the PKZIP compressed file size, is about 12%. Since I was curious what
is the distribution of errors, I made a fancy graph of it:

        http://morpheus.spectralhead.com/entropy/errordist/

This shows that by simply counting the number of occurrences of each
character in the message, using Shannon's formula I can already
estimate the compressability of arbitrary text with an average 12%
error, with an error distribution following a somewhat Gaussian-like
curve around that. Interestingly, there's a very distinct "peak" at
the 0% band, which means cases when this algorithm estimated the
compressed file size "just about right", within 1% error. This
happened with a distinctly larger probability than nearby values, and
maybe corresponds to the cases where the dictionary based compression
couldn't be effectively used (but this is just a guess). Nevertheless,
this distinct peak at 0% is very interesting.

>From this graph, we could also say that PKZIP's dictionary based
compression algorithm is on average, 12% better than simply encoding
the individual characters based on their estimated probability by
using Huffman coding. If we assume that on average, the size of PKZIP
compressed file is 12% smaller than the result of this algorithm, and
we factor that in, then using this simple character counting
algorithm, we could estimate the PKZIP compressed file size within 1%
error with 16.7% probability, within 5% error with 52.5% probability,
and within 10% error with 79.4% probability, which come from the
cumulative error distribution.

Although PKZIP's DEFALATE compression algorithm is not a precise
indicator of entropy, it uses LZ77 algorithm together with Huffman
coding (which is proven to be an optimal prefix code). Peter Shor has
shown that "in the case of random strings, Lempel Ziv [LZ77]
algorithm's compression rate asymptotically approaches the entropy".
For more info on this, please refer to
http://www-math.mit.edu/~shor/PAM/lempel_ziv_notes.pdf. So we can
assume that PKZIP's compression rate is an approximator of entropy. If
you believe this to be false, please discuss it with Mr. Shor, you'll
find him at the applied mathematics department of MIT.

So, with this simple formula, I could give a crude approximation of
the entropy of an arbitrary message without analyzing any symbol
correlations, and I also gave the expected error distribution compared
to dictionary based LZ77/Huffman encoding. The numbers do not lie.

If you - for whatever religious reason - insist on the limited, dumbed
down view that only the probability of whole messages is to be
analyzed, then consider this: instead of sending
"DBDCDBCCCBCDDDADDBCA" as a whole message, consider what happens when
you send each symbol successively as a separate message. So the first
message is "D", second message is "B", third is "D", fourth is "C",
then "D", "B", "C" ... and so on. (After all, it doesn't matter if I
send you the message as a whole, or divide it into several messages.
After recombining the parts, you will get the same message.)

In this case, if you analyze messages in a 20 message long window
frame, then you'll get the following conclusion:

- probability of message "A" was 0.1 (because 2 out of 20 messages were "A")
- probability of message "B" was 0.2 (because 4 out of 20 messages were "B")
- probability of message "C" was 0.3 (because 6 out of 20 messages were "C")
- probability of message "D" was 0.4 (because 8 out of 20 messages were "D")

When compressing some kind of data, it does not matter what happened 1
million years ago in the past, or what will happen 1 million years
from now in the future. The only thing that matters from the point of
data compression, is the probability of symbols in a given frame or
window that is analyzed. Which, in the case of Huffman encoding that
gives optimal prefix code, is simply the number of occurrences of a
given symbol in a certain window frame. In other words, it's simply a
counter. It doesn't even normalize the probability to 0-1 range, as
that's entirely unncecessary, rather, just uses the number of
occurrences as is, usually called 'weights' in the resulting weighted
binary tree, which I showed using the above simple example.

Saying "you cannot calculate the probability of an arbitrary message,
because you cannot see infinitely into past and future" is analogous
to saying "you cannot reconstruct a digitally encoded analog signal,
because you would need an infinitely long sinc filter kernel". Well...
strictly speaking, both are true, but if people actually believed
this, then we wouldn't have neither sound cards nor data compression.
If you ever in your lifetime heard about windowing, then you will know
that sinc kernels are truncated to a certain window length, and data
compresson works by analyzing probability only in a given length
window (for example, LZ77 uses a sliding window when compressing
data).

Simply counting the number occurrences of symbols in a message can
already give us 'optimal prefix code', which Dr. David A. Huffman
proved formally, outperforming Shannon-Fano encoding that existed at
the time, invented by Claude Shannon and Robert Fano, who were
colleagues at MIT. Dr. Huffman (who was a student of Fano) created an
encoding for his PhD thesis that gives better results than
Shannon-Fano method, and proved it to be the optimal prefix code for
encoding symbols with a given probability. This is what we call
Huffman encoding.

Also, if I know the language of the message, than already gives me a
lot of statistical information (and determining the language itself
can be done by statistical analysis). For example, if I know that the
message you're sending me is in English (very high probability, as it
makes no sense to send me messages in languages I do not understand),
then I already know that the most common symbol will be the letter
'e', and thus have the highest probability, and lowest entropy. I'll
also know that the second most common symbol will be 't', the third
most common will be 'a', and so on. By statistically analyzing a large
amount of English text, I can extract a lot of probability
information, for example I can tell the probabilities of each letter,
digraph, trigraph, word, two-word combination, three-word combination,
letter at word start, letter at word end etc. Thus I can tell the
average entropy of each symbol, letter, digraph, etc. And I'm not
making this up, Shannon deals with this in The Theory of
Communication. If I am not mistaken, this letter -> binary prefix code
mapping comes from Dr. Huffman:

    E->100 T->001 A->1111 O->1110 N->1100 R->1011 I->1010 S->0110 H->0101
    D->11011 L->01111 F->01001 C->01000 M->00011 U->00010 G->00001
    Y->000000 P->110101 W->011101 B->011100 V->1101001 K->110100011
    X->110100001 J->110100000 Q->1101000101 Z->1101000100
        
It is based on the simple observation that some letters are more
common (= have more probability) than others (the letter "e" being the
most common), thus they have less entropy, hence the binary prefix
code for them is shorter. So this mapping is statistically "optimal"
for encoding large amounts of English texts (= produces shortest
possible encoding). So if you send me arbitrary messages in English, I
can already tell the statistically expected probabilities of symbols
(because I can do statistical analysis on pre-existing English texts),
and hence, the "expected" entropy of each symbol or message, the
latter as the sum of the expected entropy of the individual symbols
constituting the message. Single messages will have some variance, but
if I process a lot of messages, their statistical average will
converge to the statistically expected entropy of English texts in
general. The more text I process, the more it converges. So a _very_
simple entropy estimate could be, simply multiplying the average
expected entropy per character in that language by the length of the
message, giving some very crude entropy estimate of an arbitrary
message.

If "theory" books teach you that the only thing to analyze is the the
probabilty of whole messages, then I recommend throwing those books
far away. If you believe that Shannon, Fano, Huffman and others only
analyzed whole messages and not individual symbols, then you're very
mistaken - all of these folks analyzed the probability and information
content of individual symbols in messages. So I'm not sure whoever
came up with the notion that 'entropy' is only defined based on
probability of a message as a whole, but his books should be burned.
Rather, try to implement some data or waveform compression, that will
teach you a lot more than some dumb books. Or, if you insist on
learning by reading instead of learning by doing, then at least read
the works of Shannon, Huffman etc. instead of some dumbed down
material, as that might lead you to misleading paths.

Example: if you implement a Huffman-tree, that already teaches you
that you can easily approximate the probability of each symbol in an
arbitrary message by counting the number occurrences, and using that
as weights in a binary tree, you can simply create an optimal prefix
code which is the shortest possible code to represent symbols with
those probabilities. Thus, its length estimates the "entropy" of the
message consisting of those symbols, in bits. Works for arbitrary
message.

Sure, there are other estimates, not just ones based on individual
characters, so if the message is longer, then individual words could
also be considered 'symbols', or certain combination of letters can
also be considered 'symbols' (as in the case of dictionary based
encodings). Analyzing characters is just a very simple approximation,
and it doesn't analyze the correlation between the symbols, or use
dictionaries or sophisticated pattern matching algorithms. Yet it
still approximates the LZ77/Huffman compressed file size with an
average 12% error in my tests, despite its simplicity and not
analyzing any correlations at all. The probability/entropy can be
analyzed on multiple levels - on the level of whole messages, on the
level of words, on the level of characters or character combinations,
on the level of bits etc., depending on how you break down a message
into individual symbols. But why only analyze messages as a whole?
That'd be a dumb idea. If a message has entropy, that has to be
contained in the individual symbols consituting the message, so you
can break it down further.

Some entropy encodings do not care about the actual probabilities.
Huffman encoding, Shannon-Fano encoding, Shannon-Fano-Elias encoding,
and arithmetic encoding create code based on the (measured or
estimated) probabilities of symbols. Other entropy encodings, like
Goulumb coding, Rice coding, Elias coding, unary coding etc. just
assume that the probability distribution approximates some given
formula, for example unary coding is the optimal code when the
probability distribution is geometric, like P = 2^(-N). Incidentally,
Huffman coding - when applied to geometric probability distribution -
will give precisely the unary encoding (which is also the same as
Goulomb or Rice encoding with parameter k=1), thus proving that unary
coding is optimal for geometric distribution. Here is a a Huffman tree
for an example geometric distribution:

        1: 128-----------------------------\_255_
        2: 64------------------------\_127_/
        3: 32-------------------\_63_/
        4: 16--------------\_31_/
        5: 8----------\_15_/
        6: 4------\_7_/
        7: 2--\_3_/
        8: 1--/

Resulting optimal prefix code is the same as unary code (except the last one):

        1:      0
        2:      10
        3:      110
        4:      1110
        5:      11110
        6:      111110
        7:      1111110
        8:      1111111

Do not get confused when you hear the term 'entropy encoding' - it is
simply a fancy name for 'information encoding'. And since all
information is numbers, or can be represented by numbers, all forms of
number encodings are - by definition - forms of 'entropy encoding',
which practically just means 'number encoding'. For example, if you
simply write down N ones followed by a zero to encode N, that is also
an 'entropy encoding', precisely the so-called unary encoding, which
is the same as the output of Huffman-encoding applied to a geometric
distribution, thereby proving that it is the optimal encoding for
geometric probability distribution (irregardless of the actual symbols
or message). Whenever you encode a 'message', you're actually encoding
'numbers'.

In many cases it can be known statistically that the symbols
(=numbers) will approximately follow a certain probability
distribution, irregardless of the message or signal. For example many
lossless wave compression codecs try to 'predict' the next sample from
the preceding samples, and just encode the 'error' between the actual
sample and the prediction. This way, it can be expected statistically
that the 'error' will predominantly be a small number, irregardless of
what sample you actually encode, thus a form of encoding optimized for
such probabilities, like Rice coding, will give good results,
irregardless of the actual probabilities of symbols, which is
irrelevant. The advantage of simply assuming or implying some
probability distribution based on statistical data is that you do not
actually need to transmit the symbol/probability pairs ("codebook")
used to construct the Huffman tree, therefore it can result in actual
savings when doing data compression compared to entropy encodings that
actually measure the probabilities, like Huffman encoding, as
transmitting the Huffman tree also creates overhead.

Some naive readers might think that this topic has nothing to do with
*music* DSP, but that is a false belief - entropy encoding is *very*
relevant to musical DSP processing, as many audio compression
algorithms employ these entropy encodings when working with audio
data. So we could say these are "advanced" DSP topics, implemented in
certain audio codecs. Using some simple bit tricks and a new entropy
encoding technique, I created a practical, lossless waveform
compressor algorithm with ratios comparable to that of FLAC, and -
contrary to most encoders - I'm not using any predictor, which means,
there is probably room for further improvement. Hopefully I can soon
put this algorithm online (and ideally, a paper would also be nice, if
I find time for it).

Best regards,
Peter Schoffhauzer
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews, dsp 
links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp

[music-dsp] about entropy encoding

Reply via email to