----- Forwarded message from Kragen Javier Sitaker <kra...@canonical.org> -----
Date: Tue, 18 Oct 2011 02:21:20 -0400 From: Kragen Javier Sitaker <kra...@canonical.org> To: Darius Bacon <dar...@wry.me> Subject: Re: radix-sorting rational numbers with an efficient serialization of continued fractions Message-ID: <20111018062120.ga9...@canonical.org> References: <20111017225537.ga1...@canonical.org> <201110180002.p9i02y5h003...@wry.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <201110180002.p9i02y5h003...@wry.me> User-Agent: Mutt/1.5.20 (2009-06-14) Status: RO Content-Length: 2559 Lines: 48 On Mon, Oct 17, 2011 at 05:02:34PM -0700, Darius Bacon wrote: > Only skimmed this yet, but I'll note that your E(n) appeared in > _Managing Gigabytes_ as the 'gamma code' and is said to perform > reasonably for encoding postings lists. Thank you very much! I had no idea. It looks like Elias's original gamma code uses 00001 instead of 11110 to represent "five-bit number", which means his code isn't lexicographically sortable, but the Managing Gigabytes version of it (on p.117, table 3.5, "Example codes for integers") is indeed identical to my E(n). The LingPipe implementation of Elias gamma coding is documented to use Elias's version. src.it.unimi.dsi.mg4j.io.ByteArrayPostingList.writeGamma calls writeUnary, which uses Elias's non-sortable version (`while( i-- != 0 ) write( 0 );`) instead of the MG book's sortable version. <http://www.java2s.com/Open-Source/Java-Document/Search-Engine/mg4j/it/unimi/dsi/mg4j/io/ByteArrayPostingList.java.htm> > Delta code is a bit fancier: the number of bits used for the 'actual number' > is encoded with the gamma code. Right. Which one is better depends on your input distribution and your utility function. They're actually encoding posting-list *gaps*, i.e. the differences between adjacent document IDs in a posting list. I suspect that Golomb coding, with the divisor M set to (posting list length / max document id), would provide better compression than Elias's gamma or delta codes. WP says that Golomb coding is the optimal prefix code for alphabets following a geometric distribution, such as the intervals between successes in a Bernoulli process. If your document IDs are randomly assigned, then the posting list for a particular term is guaranteed to be a Bernoulli process. The only question is whether some nonrandom distribution might be able to provide better compression using gamma or delta coding. They mention Golomb coding on pp.121-2. They say it "performs only marginally better than global γ or δ codes," then go on to talk about hybrid coding. I think the search engine I wrote in 2006 <http://lists.canonical.org/pipermail/kragen-hacks/2006-August/000432.html> might benefit from using Golomb coding or something similar for its pseudo-posting-lists. It currently uses the "Altavista trick" to encode integers in variable numbers of bytes, 7 bits per byte, with one bit per byte used as a termination marker, but I suspect that it can probably use substantially less space with Golomb coding. Do you mind if I forward your mail and this one to kragen-discuss? Kragen ----- End forwarded message ----- -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss