Hi, all!

In case anyone is interested, the following is a quick-and-dirty that
will perform the character-level entropy calculation for a specific
data source supplied as an argument (file or string).

The result tells the minimum number of bits per character required to
transmit without loss any message from a source having the same
distribution as the supplied source.

Run against its own source (without the verbose flag) it calculates

    >> char-ent/calc-ent %char-ent.r

    ( 2180 ) characters

    Source entropy is 4.78716832865254 bits per character. 
    Maximum compression gives 40.1603958918433 percent saving. 

If we take the source code for this program as "typical" for REBOL,
then any character-level compression scheme that squishes REBOL
scripts down to less than about 59.84% of their original size can
only do so by losing information.

-jn-

8<------------------------------------------------------------
#!/usr/local/bin/rebol

REBOL [
    title: "calc-ent"
    file:  "calc-ent.r"
    author: "Joel Neely"
    description: {
        Calculates the character entropy of a data source,
        which must be a string or a file.
    }
]

;
; everything is wrapped in an object to protect the
; global namespace
;

char-ent: make object! [

;
; character frequencies are accumulated here
; (8-bit characters are supported)
;
    char-freq: array/initial 256 0

;
; the final result is calculated and retained here
;
    entropy: 0.0

;
; the work is done in this function
;
    calc-ent: function [
        data [string! file!]
        /verbose
    ][
        ichar   ; index for current char (off by 1!)
        iqty    ; frequency for current character
        tqty    ; total character count
                ; (used instead of length? data in case
                ;  we want to add logic in the input loop
                ;  to consider only some characters, e.g. alpha)
    ][
        ;
        ; initialize accumulators
        ;
        entropy: 0.0
        tqty: 0

        ;
        ; grab content if argument is file
        ;
        if file? data [data: read data]

        ;
        ; accumulate character frequencies from data
        ;
        ; wrap loop body in an 'if to consider only
        ; a subset of the characters from the source
        ;
        foreach char data [
            ;
            ; offset char value by one since REBOL can't
            ; understand zero-based indexing ... sigh ;-)
            ;
            ichar: 1 + to-integer char
            poke char-freq ichar (1 + pick char-freq ichar)
            tqty: tqty + 1
        ]

        ;
        ; print detail report (if /verbose refinement is set)
        ; and accumulate final result
        ;
        print ["^/(" tqty ") characters^/"]
        for ichar 1 256 1 [
            ;
            ; ignore unused character values
            ;
            if 0 < iqty: pick char-freq ichar [
                use [p q] [
                    p: iqty / tqty  ; char probability
                    q: p * log-2 p  ; (negative) bits
                    if verbose [
                        print [
                            mold to-char ichar - 1 tab
                            iqty tab
                            p tab q
                        ]
                    ]
                    ;
                    ; accumulate weighted (positive) bits per character
                    ;
                    entropy: entropy - q
                ]
            ]
        ]
        ;
        ; result is character entropy FOR THIS SOURCE ONLY
        ;
        print [
            "Source entropy is" entropy "bits per character." newline
            "Maximum compression gives"
            100.0 * (1.0 - (entropy / 8.0))
            "percent saving." newline
        ]
    ]
]
-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.

Reply via email to