Just a small idea. I would make makedict also take a type so that
makedict(1024) also works. Alternatively using typemax(UInt8) would remove
the dependence on the magic number.
On Saturday, 2 January 2016 03:05:19 UTC+9, Erik Schnetter wrote:
>
> Julia distinguishes between two kinds of strings, ASCIIString and
> UTF8String, both subtype of AbstractString. Your code seems to assume
> that a character can handle values between 0 and 255, which is not
> true in Julia. Also, the way the code decomposes the uncompressed
> string rather assumes that each element of the uncompressed string is
> a byte, not a code point, and the resulting strings stored in the
> dictionary are not necessarily value UTF8. Of course, your algorithm
> is completely valid, however, it may be better to use the types UInt8
> and Vector{UInt8} instead of Char and AbstractString. An important
> respective type is `IOBuffer`, and an important function is
> `takebuf_array`.
>
> Smaller points are that Julia prefers the type `Int` over `UInt32` for
> sizes. I would use `Int` instead of both `UInt32` and `Int64`, without
> loss of generality.
>
> This does not explain the large amount of allocations. My assumption
> -- untested, but corroborated by profiling -- is that these come from
> the string concatenation operator `*` in line 4 of `lzw_compress`.
> Instead of repeatedly appending characters to a string, I would use an
> IOBuffer, where appending is much cheaper. You can then use
> `takebuf_array` to obtain the buffer's byte array. The dictionary
> would have type `Dict{Vector{UInt8}, Int}`.
>
> Here is my respective version:
>
> function makedict(size)
> dict = Dict{Vector{UInt8},Int}()
> for i in 0:size-1
> dict[UInt8[i]] = i
> end
> dict
> end
>
> function lzw_compress(uncompressed)
> dict = makedict(256)
> w, widx = IOBuffer(), -1
> result = Int[]
> for b in IOBuffer(uncompressed1).data
> write(w, b)
> newidx = get(dict, w.data, -1)
> if newidx != -1
> widx = newidx
> else
> push!(result, widx)
> dict[takebuf_array(w)] = length(dict)
> write(w, b)
> widx = dict[w.data]
> end
> end
>
> widx!=-1 && push!(result, widx)
> result
> end
>
> In C++, Julia's IOBuffer would be a stringstream, in Java, it would be
> a StringBuffer.
>
> -erik
>
>
> On Fri, Jan 1, 2016 at 9:59 AM, Robert Feldt <[email protected]
> <javascript:>> wrote:
> > I was planning to add a Julia entry for LZW compression at Rosetta Code,
> see
> > http://rosettacode.org/wiki/LZW_compression , and did a first Julia
> > implementation below. It allocates quite some mem though so ideas for
> > speeding it up and/or clarifying the code before I submit it would be
> > appreciated. Thanks!
> >
> > julia> function makedict(size)
> > dict = Dict{AbstractString,UInt32}()
> > [dict[string(Char(i))] = i for i in 0:(size-1)]
> > dict
> > end
> > makedict (generic function with 1 method)
> >
> > julia> function lzw_compress(uncompressed)
> > dict, w, result = makedict(256), "", Int64[]
> > for i in 1:length(uncompressed)
> > wc = w * uncompressed[i:i]
> > if haskey(dict, wc)
> > w = wc
> > else
> > push!(result, dict[w])
> > dict[wc] = length(dict)
> > w = uncompressed[i:i]
> > end
> > end
> >
> > isempty(w) || push!(result, dict[w])
> > result
> > end
> > lzw_compress (generic function with 1 method)
> >
> > julia> t = "TOBEORNOTTOBEORTOBEORNOT"
> > "TOBEORNOTTOBEORTOBEORNOT"
> >
> > julia> compressed = lzw_compress(t)
> > 16-element Array{Int64,1}:
> > 84
> > 79
> > 66
> > 69
> > 79
> > 82
> > 78
> > 79
> > 84
> > 256
> > 258
> > 260
> > 265
> > 259
> > 261
> > 263
> >
> > julia> tl = t ^ 10000;
> >
> > julia> @time lzw_compress(tl);
> > 0.101714 seconds (971.11 k allocations: 54.181 MB, 28.92% gc time)
>
>
>
> --
> Erik Schnetter <[email protected] <javascript:>>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>