Hi All, Some months ago there was a heated debate where some proponents claimed, "it is impossible to estimate the entropy of an arbitrary message, because we cannot calculate the probability". I still think that claim is a mistake, and I'll prove it using a simple, real-world example.
Consider the following, arbitrary message: DBDCDBCCCBCDDDADDBCA In this message, there are 4 symbols: A, B, C, and D. Let's count the occurrences of each: A: 2 B: 4 C: 6 D: 8 The total number of symbols is 20. So, the probabilities of each symbol in this message are: A: 2/20 = 0.1 B: 4/20 = 0.2 C: 6/20 = 0.3 D: 8/20 = 0.4 By simply counting the occurrences of each symbol, I already estimated the probability of each symbol in the message. Now let's calculate the optimal code length of each symbol: "According to Shannon's source coding theorem, the optimal code length for a symbol is -log_b (P), where b is the number of symbols used to make output codes and P is the probability of the input symbol." (source: https://en.wikipedia.org/wiki/Entropy_encoding) So, if we want to encode the above symbols using binary, which is 2 symbols (0 and 1), let's calculate the code lengths using Shannon's above formula: A: -log2(0.1) = 3.321928094887362 B: -log2(0.2) = 2.321928094887362 C: -log2(0.3) = 1.7369655941662063 D: -log2(0.4) = 1.3219280948873622 In other words, the optimal code lengths to encode each symbol, according to Shannon, are the above, so an optimal data compressor will encode them using around this many bits. If we multiply that by the number of occurrences in the message, we'll get the theoretically "optimal" compressed message size in bits, which is the "entropy" of the message in bits: (-log2(0.1)*2) + (-log2(0.2)*4) + (-log2(0.3)*6) + (-log2(0.4)*8) = 36.92878689342031. This message, using optimal compression, will be about 37 bits long, so practically this is the "entropy" of this message. To confirm this, let's compress the above message by creating a Huffman tree, which is an optimal prefix code, in other words, the shortest possible prefix code. It is built by sorting the messages based on number of occurrences (= probabilities): D: 8 C: 6 B: 4 A: 2 Then, simply build a tree by merging the lowest two branches of the tree with the smallest weights, until the whole tree is complete (best viewed with monospaced font): D: 8 ----------------------------\__[DCBA]:20__ C: 6 ---------------\__[CBA]:12__/ B: 4 ----\__[BA]:6__/ A: 2 ----/ To get the prefix codes for each symbol, simply traverse the tree from the root, and write down a 0 at every up turn, and 1 for every down turn (or vice versa, doesn't matter): D: 0 C: 10 B: 110 A: 111 That is a possible list of prefix codes for using Huffman encoding for the message "DBDCDBCCCBCDDDADDBCA". Notice, the length of these codes are within one bit of the theoretical optimal length code calculated earlier using Shannon's formula: D: length(0) = 1, -log2(0.4) = 1.3219280948873622 C: length(10) = 2, -log2(0.3) = 1.7369655941662063 B: length(110) = 3, -log2(0.2) = 2.321928094887362 A: length(111) = 3, -log2(0.1) = 3.321928094887362 Using this prefix code, the message "DBDCDBCCCBCDDDADDBCA" can be encoded as binary, as follows: 0 110 0 10 0 110 10 10 10 110 10 0 0 0 111 0 0 110 10 111 Or, written together: 01100100110101010110100001110011010111 ...which has a length of 38 bits, which is just above what Shannon's theoretical "optimal code length" formula gave us (~37 bits). Huffman code gives provably 'optimal length' prefix code, so no data compressor will compress the above message in a string of bits shorter than 38 bits. The original, message, encoded as 8-bit ASCII is 8*20 = 160 bits long. Encoded as 5-bit Baudot code, it would be 5*20 = 100 bits long. Encoded as 2-bit binary (00=A, 01=B, 10=C, 11=D), it would be 2*20 = 40 bits long. Compared to ASCII, message size is 23.75% (76.25% compression) Compared to Baudot, message size is 38% (62% compression). Compared to 2-bit encoding, message size is 95% (5% compression, but that does not count the codebook overhead). So in this simple example, I estimated the "optimal code length" (~= "entropy" in bits) of the arbitrary message "DBDCDBCCCBCDDDADDBCA" as the sum of the estimated entropy of symbols costituting the message by simply counting their occurrences, which is ~36.9 bits (using Shannon's theoretical optimum formula), or 38 bits (using Huffman encoding, which is provably optimal prefix code). This simple estimate can be repeated for any, arbitrarily chosen message. I created a simple program that does this automatically: http://morpheus.spectralhead.com/entropy/entest.tcl.txt When ran, it reads input from standard input until EOF (Control-Z on Windows, Control-D on *nix), then prints a table of the characters in the input, their occurence count and estimated probability, and their estimated entropy in bits (using Shannon's formula). Finally, it prints the estimated total entropy in bits (and for large inputs, as kilobytes as well). Optionally, a file name to be processed can be given as an argument. So, in other words, this program calculates SUM [i=1..N] -log2( count_i/T )*count_i ...where T is the total number of symbols (symbols assumed to be individual characters), count_i is the number of occurrences of the ith symbol, and count_i/T is the estimated probability of that symbol in the message. So, when ran on the message "DBDCDBCCCBCDDDADDBCA", it will print this: Symbol Count Probab. Entropy ------------------------------- D: 8 p=0.400, ent=1.322 C: 6 p=0.300, ent=1.737 B: 4 p=0.200, ent=2.322 A: 2 p=0.100, ent=3.322 Estimated total entropy: 36.9 bits When testing this program, I found that it gives a rough estimate of how well a file can be compressed, with for example, PKZIP. Usually, compression algorithms will result in a smaller file size as they use dictionary based compression, instead of just encoding every character based on its probability by occurrence count. Since I was curious how well this formula estimates the compressability of arbitrary data (which inversely correlates with entropy), I downloaded about a gigabyte of various text files from the textfiles.com archive (a total of 44,618 files), and ran this estimator on each of them (this program ran for hours). Here are the raw results [2.7 MB]: http://morpheus.spectralhead.com/entropy/entest-results.txt I found that on average, the error between the entropy estimate and the PKZIP compressed file size, is about 12%. Since I was curious what is the distribution of errors, I made a fancy graph of it: http://morpheus.spectralhead.com/entropy/errordist/ This shows that by simply counting the number of occurrences of each character in the message, using Shannon's formula I can already estimate the compressability of arbitrary text with an average 12% error, with an error distribution following a somewhat Gaussian-like curve around that. Interestingly, there's a very distinct "peak" at the 0% band, which means cases when this algorithm estimated the compressed file size "just about right", within 1% error. This happened with a distinctly larger probability than nearby values, and maybe corresponds to the cases where the dictionary based compression couldn't be effectively used (but this is just a guess). Nevertheless, this distinct peak at 0% is very interesting. >From this graph, we could also say that PKZIP's dictionary based compression algorithm is on average, 12% better than simply encoding the individual characters based on their estimated probability by using Huffman coding. If we assume that on average, the size of PKZIP compressed file is 12% smaller than the result of this algorithm, and we factor that in, then using this simple character counting algorithm, we could estimate the PKZIP compressed file size within 1% error with 16.7% probability, within 5% error with 52.5% probability, and within 10% error with 79.4% probability, which come from the cumulative error distribution. Although PKZIP's DEFALATE compression algorithm is not a precise indicator of entropy, it uses LZ77 algorithm together with Huffman coding (which is proven to be an optimal prefix code). Peter Shor has shown that "in the case of random strings, Lempel Ziv [LZ77] algorithm's compression rate asymptotically approaches the entropy". For more info on this, please refer to http://www-math.mit.edu/~shor/PAM/lempel_ziv_notes.pdf. So we can assume that PKZIP's compression rate is an approximator of entropy. If you believe this to be false, please discuss it with Mr. Shor, you'll find him at the applied mathematics department of MIT. So, with this simple formula, I could give a crude approximation of the entropy of an arbitrary message without analyzing any symbol correlations, and I also gave the expected error distribution compared to dictionary based LZ77/Huffman encoding. The numbers do not lie. If you - for whatever religious reason - insist on the limited, dumbed down view that only the probability of whole messages is to be analyzed, then consider this: instead of sending "DBDCDBCCCBCDDDADDBCA" as a whole message, consider what happens when you send each symbol successively as a separate message. So the first message is "D", second message is "B", third is "D", fourth is "C", then "D", "B", "C" ... and so on. (After all, it doesn't matter if I send you the message as a whole, or divide it into several messages. After recombining the parts, you will get the same message.) In this case, if you analyze messages in a 20 message long window frame, then you'll get the following conclusion: - probability of message "A" was 0.1 (because 2 out of 20 messages were "A") - probability of message "B" was 0.2 (because 4 out of 20 messages were "B") - probability of message "C" was 0.3 (because 6 out of 20 messages were "C") - probability of message "D" was 0.4 (because 8 out of 20 messages were "D") When compressing some kind of data, it does not matter what happened 1 million years ago in the past, or what will happen 1 million years from now in the future. The only thing that matters from the point of data compression, is the probability of symbols in a given frame or window that is analyzed. Which, in the case of Huffman encoding that gives optimal prefix code, is simply the number of occurrences of a given symbol in a certain window frame. In other words, it's simply a counter. It doesn't even normalize the probability to 0-1 range, as that's entirely unncecessary, rather, just uses the number of occurrences as is, usually called 'weights' in the resulting weighted binary tree, which I showed using the above simple example. Saying "you cannot calculate the probability of an arbitrary message, because you cannot see infinitely into past and future" is analogous to saying "you cannot reconstruct a digitally encoded analog signal, because you would need an infinitely long sinc filter kernel". Well... strictly speaking, both are true, but if people actually believed this, then we wouldn't have neither sound cards nor data compression. If you ever in your lifetime heard about windowing, then you will know that sinc kernels are truncated to a certain window length, and data compresson works by analyzing probability only in a given length window (for example, LZ77 uses a sliding window when compressing data). Simply counting the number occurrences of symbols in a message can already give us 'optimal prefix code', which Dr. David A. Huffman proved formally, outperforming Shannon-Fano encoding that existed at the time, invented by Claude Shannon and Robert Fano, who were colleagues at MIT. Dr. Huffman (who was a student of Fano) created an encoding for his PhD thesis that gives better results than Shannon-Fano method, and proved it to be the optimal prefix code for encoding symbols with a given probability. This is what we call Huffman encoding. Also, if I know the language of the message, than already gives me a lot of statistical information (and determining the language itself can be done by statistical analysis). For example, if I know that the message you're sending me is in English (very high probability, as it makes no sense to send me messages in languages I do not understand), then I already know that the most common symbol will be the letter 'e', and thus have the highest probability, and lowest entropy. I'll also know that the second most common symbol will be 't', the third most common will be 'a', and so on. By statistically analyzing a large amount of English text, I can extract a lot of probability information, for example I can tell the probabilities of each letter, digraph, trigraph, word, two-word combination, three-word combination, letter at word start, letter at word end etc. Thus I can tell the average entropy of each symbol, letter, digraph, etc. And I'm not making this up, Shannon deals with this in The Theory of Communication. If I am not mistaken, this letter -> binary prefix code mapping comes from Dr. Huffman: E->100 T->001 A->1111 O->1110 N->1100 R->1011 I->1010 S->0110 H->0101 D->11011 L->01111 F->01001 C->01000 M->00011 U->00010 G->00001 Y->000000 P->110101 W->011101 B->011100 V->1101001 K->110100011 X->110100001 J->110100000 Q->1101000101 Z->1101000100 It is based on the simple observation that some letters are more common (= have more probability) than others (the letter "e" being the most common), thus they have less entropy, hence the binary prefix code for them is shorter. So this mapping is statistically "optimal" for encoding large amounts of English texts (= produces shortest possible encoding). So if you send me arbitrary messages in English, I can already tell the statistically expected probabilities of symbols (because I can do statistical analysis on pre-existing English texts), and hence, the "expected" entropy of each symbol or message, the latter as the sum of the expected entropy of the individual symbols constituting the message. Single messages will have some variance, but if I process a lot of messages, their statistical average will converge to the statistically expected entropy of English texts in general. The more text I process, the more it converges. So a _very_ simple entropy estimate could be, simply multiplying the average expected entropy per character in that language by the length of the message, giving some very crude entropy estimate of an arbitrary message. If "theory" books teach you that the only thing to analyze is the the probabilty of whole messages, then I recommend throwing those books far away. If you believe that Shannon, Fano, Huffman and others only analyzed whole messages and not individual symbols, then you're very mistaken - all of these folks analyzed the probability and information content of individual symbols in messages. So I'm not sure whoever came up with the notion that 'entropy' is only defined based on probability of a message as a whole, but his books should be burned. Rather, try to implement some data or waveform compression, that will teach you a lot more than some dumb books. Or, if you insist on learning by reading instead of learning by doing, then at least read the works of Shannon, Huffman etc. instead of some dumbed down material, as that might lead you to misleading paths. Example: if you implement a Huffman-tree, that already teaches you that you can easily approximate the probability of each symbol in an arbitrary message by counting the number occurrences, and using that as weights in a binary tree, you can simply create an optimal prefix code which is the shortest possible code to represent symbols with those probabilities. Thus, its length estimates the "entropy" of the message consisting of those symbols, in bits. Works for arbitrary message. Sure, there are other estimates, not just ones based on individual characters, so if the message is longer, then individual words could also be considered 'symbols', or certain combination of letters can also be considered 'symbols' (as in the case of dictionary based encodings). Analyzing characters is just a very simple approximation, and it doesn't analyze the correlation between the symbols, or use dictionaries or sophisticated pattern matching algorithms. Yet it still approximates the LZ77/Huffman compressed file size with an average 12% error in my tests, despite its simplicity and not analyzing any correlations at all. The probability/entropy can be analyzed on multiple levels - on the level of whole messages, on the level of words, on the level of characters or character combinations, on the level of bits etc., depending on how you break down a message into individual symbols. But why only analyze messages as a whole? That'd be a dumb idea. If a message has entropy, that has to be contained in the individual symbols consituting the message, so you can break it down further. Some entropy encodings do not care about the actual probabilities. Huffman encoding, Shannon-Fano encoding, Shannon-Fano-Elias encoding, and arithmetic encoding create code based on the (measured or estimated) probabilities of symbols. Other entropy encodings, like Goulumb coding, Rice coding, Elias coding, unary coding etc. just assume that the probability distribution approximates some given formula, for example unary coding is the optimal code when the probability distribution is geometric, like P = 2^(-N). Incidentally, Huffman coding - when applied to geometric probability distribution - will give precisely the unary encoding (which is also the same as Goulomb or Rice encoding with parameter k=1), thus proving that unary coding is optimal for geometric distribution. Here is a a Huffman tree for an example geometric distribution: 1: 128-----------------------------\_255_ 2: 64------------------------\_127_/ 3: 32-------------------\_63_/ 4: 16--------------\_31_/ 5: 8----------\_15_/ 6: 4------\_7_/ 7: 2--\_3_/ 8: 1--/ Resulting optimal prefix code is the same as unary code (except the last one): 1: 0 2: 10 3: 110 4: 1110 5: 11110 6: 111110 7: 1111110 8: 1111111 Do not get confused when you hear the term 'entropy encoding' - it is simply a fancy name for 'information encoding'. And since all information is numbers, or can be represented by numbers, all forms of number encodings are - by definition - forms of 'entropy encoding', which practically just means 'number encoding'. For example, if you simply write down N ones followed by a zero to encode N, that is also an 'entropy encoding', precisely the so-called unary encoding, which is the same as the output of Huffman-encoding applied to a geometric distribution, thereby proving that it is the optimal encoding for geometric probability distribution (irregardless of the actual symbols or message). Whenever you encode a 'message', you're actually encoding 'numbers'. In many cases it can be known statistically that the symbols (=numbers) will approximately follow a certain probability distribution, irregardless of the message or signal. For example many lossless wave compression codecs try to 'predict' the next sample from the preceding samples, and just encode the 'error' between the actual sample and the prediction. This way, it can be expected statistically that the 'error' will predominantly be a small number, irregardless of what sample you actually encode, thus a form of encoding optimized for such probabilities, like Rice coding, will give good results, irregardless of the actual probabilities of symbols, which is irrelevant. The advantage of simply assuming or implying some probability distribution based on statistical data is that you do not actually need to transmit the symbol/probability pairs ("codebook") used to construct the Huffman tree, therefore it can result in actual savings when doing data compression compared to entropy encodings that actually measure the probabilities, like Huffman encoding, as transmitting the Huffman tree also creates overhead. Some naive readers might think that this topic has nothing to do with *music* DSP, but that is a false belief - entropy encoding is *very* relevant to musical DSP processing, as many audio compression algorithms employ these entropy encodings when working with audio data. So we could say these are "advanced" DSP topics, implemented in certain audio codecs. Using some simple bit tricks and a new entropy encoding technique, I created a practical, lossless waveform compressor algorithm with ratios comparable to that of FLAC, and - contrary to most encoders - I'm not using any predictor, which means, there is probably room for further improvement. Hopefully I can soon put this algorithm online (and ideally, a paper would also be nice, if I find time for it). Best regards, Peter Schoffhauzer -- dupswapdrop -- the music-dsp mailing list and website: subscription info, FAQ, source code archive, list archive, book reviews, dsp links http://music.columbia.edu/cmc/music-dsp http://music.columbia.edu/mailman/listinfo/music-dsp