On Tue, Dec 15, 2009 at 7:46 AM, Sam Van Oort <[email protected]> wrote: > On Dec 10, 6:26 pm, Tatu Saloranta <[email protected]> wrote: ... >> Also: using chunk identifiers like command-line tools has one nice >> benefit; that is, you can just sequence blocks without restrictions as >> there is no initial header (which can be downside in some cases too). > Could you explain a little more? I haven't looked at the C LZF code, > to avoid possible legal problems. I can do that now that my Java LZF > extensions appear complete.
It just means that there is no separate header for the whole file, but sequence of blocks/chunks, all with 2-byte marker 'ZV' (not sure why those were chosen), single-byte type (0/1 for non-compress/compress), followed by 2-byte length of the original data. And for compressed blocks, another 2-byte length indicator for compressed length. H2 uses 4-byte total length indicator, and then sequence of chunks (I think). That works, you just can not simply concatenate resulting files together. >> What kind of improvements are there? Better hashing? I assume changes >> to format wouldn't be needed? > No changes to format... just a system that stores more hashes and > checks for the best of several candidate back-references. Also a > variant that hashes *all* bytes rather than just the literals and last > couple from each back-reference, but that version is still too slow to > be useful. Right. I am interested in performance trade-offs. Given that current version already compresses VERY fast (testing I did suggested 5x faster than gzip... I need to double-check, since it is almost as fast as decompression), there is some more for more processing if it can significantly help finding optimal sequences. Alternatively maybe there are things that can be done on decompression side as well; your numbers suggest this is indeed the case. But even as things are now, LZF decompress is +25 - +100% compared to Inflate (gzip) on my tests. > As it stands now, it looks like these won't make it into the H2 > codebase, because Mr. Mueller wants to keep the code pared down, but > here's a teaser in case anyone is interested in pre-release versions. I think that is sensible -- as Thomas said, the piece that H2 itself needs should be optimized for its needs. And so extended reusable version can be separate. Results definitely look encouraging, thank you for sharing them. For testing I have used Japex with success, so if you want to get pretty graphs for testing you may want to have a look (or I can help -- I have plans to do something in this area). -+ Tatu +- -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/h2-database?hl=en.
