Julia distinguishes between two kinds of strings, ASCIIString and
UTF8String, both subtype of AbstractString. Your code seems to assume
that a character can handle values between 0 and 255, which is not
true in Julia. Also, the way the code decomposes the uncompressed
string rather assumes that each element of the uncompressed string is
a byte, not a code point, and the resulting strings stored in the
dictionary are not necessarily value UTF8. Of course, your algorithm
is completely valid, however, it may be better to use the types UInt8
and Vector{UInt8} instead of Char and AbstractString. An important
respective type is `IOBuffer`, and an important function is
`takebuf_array`.

Smaller points are that Julia prefers the type `Int` over `UInt32` for
sizes. I would use `Int` instead of both `UInt32` and `Int64`, without
loss of generality.

This does not explain the large amount of allocations. My assumption
-- untested, but corroborated by profiling -- is that these come from
the string concatenation operator `*` in line 4 of `lzw_compress`.
Instead of repeatedly appending characters to a string, I would use an
IOBuffer, where appending is much cheaper. You can then use
`takebuf_array` to obtain the buffer's byte array. The dictionary
would have type `Dict{Vector{UInt8}, Int}`.

Here is my respective version:

function makedict(size)
    dict = Dict{Vector{UInt8},Int}()
    for i in 0:size-1
        dict[UInt8[i]] = i
    end
    dict
end

function lzw_compress(uncompressed)
    dict = makedict(256)
    w, widx = IOBuffer(), -1
    result = Int[]
    for b in IOBuffer(uncompressed1).data
        write(w, b)
        newidx = get(dict, w.data, -1)
        if newidx != -1
            widx = newidx
        else
            push!(result, widx)
            dict[takebuf_array(w)] = length(dict)
            write(w, b)
            widx = dict[w.data]
        end
    end

    widx!=-1 && push!(result, widx)
    result
end

In C++, Julia's IOBuffer would be a stringstream, in Java, it would be
a StringBuffer.

-erik


On Fri, Jan 1, 2016 at 9:59 AM, Robert Feldt <[email protected]> wrote:
> I was planning to add a Julia entry for LZW compression at Rosetta Code, see
> http://rosettacode.org/wiki/LZW_compression , and did a first Julia
> implementation below. It allocates quite some mem though so ideas for
> speeding it up and/or clarifying the code before I submit it would be
> appreciated. Thanks!
>
> julia> function makedict(size)
>         dict = Dict{AbstractString,UInt32}()
>         [dict[string(Char(i))] = i for i in 0:(size-1)]
>         dict
>        end
> makedict (generic function with 1 method)
>
> julia> function lzw_compress(uncompressed)
>         dict, w, result = makedict(256), "", Int64[]
>         for i in 1:length(uncompressed)
>         wc = w * uncompressed[i:i]
>         if haskey(dict, wc)
>         w = wc
>         else
>         push!(result, dict[w])
>         dict[wc] = length(dict)
>         w = uncompressed[i:i]
>         end
>         end
>
>         isempty(w) || push!(result, dict[w])
>         result
>        end
> lzw_compress (generic function with 1 method)
>
> julia> t = "TOBEORNOTTOBEORTOBEORNOT"
> "TOBEORNOTTOBEORTOBEORNOT"
>
> julia> compressed = lzw_compress(t)
> 16-element Array{Int64,1}:
>   84
>   79
>   66
>   69
>   79
>   82
>   78
>   79
>   84
>  256
>  258
>  260
>  265
>  259
>  261
>  263
>
> julia> tl = t ^ 10000;
>
> julia> @time lzw_compress(tl);
>   0.101714 seconds (971.11 k allocations: 54.181 MB, 28.92% gc time)



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to