Regarding the algorithm itself, I am now using it on a byte level instead of lines level, so it works on any kind of data.
The algorithm's runtime is indeed linear in the input length, which still amazes me. How is compression in linear time even possible? Yet it is. The basic trick is that by combining repeated pairs into a new symbol, we successively basically halve the input length and thus we end up at a single amortized operation per input byte - while still finding structures of any length. Just to clarify: This complexity class of O(n) is the lowest conceivable class for any compression algorithm whatsoever. You just can't get any lower than looking at each character once. Sadly, despite this theoretically excellent linear complexity, the algorithm is not that fast in practical terms. Each byte takes 5 to 7 microseconds to process, so we are looking at a maximum of ~200 kB/sec. (Benchmark.) <http://code.botcompany.de/1028520> These results are for a single core as the algorithm is not easy to parallelize, although I suppose it could be done. Memory-wise, the optimized Java implementation consumes about 100 bytes per byte in the input which I guess is pretty decent. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-M9ee73b5390fb07a0c885d8b8 Delivery options: https://agi.topicbox.com/groups/agi/subscription
