[kragen@canonical.org: Re: radix-sorting rational numbers with an efficient serialization of continued fractions]

Kragen Javier Sitaker Tue, 18 Oct 2011 00:04:04 -0700

----- Forwarded message from Kragen Javier Sitaker <kra...@canonical.org> -----

Date: Tue, 18 Oct 2011 02:21:20 -0400
From: Kragen Javier Sitaker <kra...@canonical.org>
To: Darius Bacon <dar...@wry.me>
Subject: Re: radix-sorting rational numbers with an efficient serialization
        of continued fractions
Message-ID: <20111018062120.ga9...@canonical.org>
References: <20111017225537.ga1...@canonical.org>
        <201110180002.p9i02y5h003...@wry.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <201110180002.p9i02y5h003...@wry.me>
User-Agent: Mutt/1.5.20 (2009-06-14)
Status: RO
Content-Length: 2559
Lines: 48

On Mon, Oct 17, 2011 at 05:02:34PM -0700, Darius Bacon wrote:
> Only skimmed this yet, but I'll note that your E(n) appeared in
> _Managing Gigabytes_ as the 'gamma code' and is said to perform
> reasonably for encoding postings lists.

Thank you very much! I had no idea.

It looks like Elias's original gamma code uses 00001 instead of 11110 to
represent "five-bit number", which means his code isn't lexicographically
sortable, but the Managing Gigabytes version of it (on p.117, table 3.5,
"Example codes for integers") is indeed identical to my E(n).  The LingPipe
implementation of Elias gamma coding is documented to use Elias's version.
src.it.unimi.dsi.mg4j.io.ByteArrayPostingList.writeGamma calls writeUnary,
which uses Elias's non-sortable version (`while( i-- != 0 ) write( 0 );`)
instead of the MG book's sortable version.
<http://www.java2s.com/Open-Source/Java-Document/Search-Engine/mg4j/it/unimi/dsi/mg4j/io/ByteArrayPostingList.java.htm>

> Delta code is a bit fancier: the number of bits used for the 'actual number'
> is encoded with the gamma code.

Right.  Which one is better depends on your input distribution and your utility
function.

They're actually encoding posting-list *gaps*, i.e. the differences between
adjacent document IDs in a posting list.  I suspect that Golomb coding, with
the divisor M set to (posting list length / max document id), would provide
better compression than Elias's gamma or delta codes.  WP says that Golomb
coding is the optimal prefix code for alphabets following a geometric
distribution, such as the intervals between successes in a Bernoulli process.
If your document IDs are randomly assigned, then the posting list for a
particular term is guaranteed to be a Bernoulli process.  The only question is
whether some nonrandom distribution might be able to provide better compression
using gamma or delta coding.

They mention Golomb coding on pp.121-2.  They say it "performs only marginally
better than global γ or δ codes," then go on to talk about hybrid coding.

I think the search engine I wrote in 2006
<http://lists.canonical.org/pipermail/kragen-hacks/2006-August/000432.html>
might benefit from using Golomb coding or something similar for its
pseudo-posting-lists.  It currently uses the "Altavista trick" to encode
integers in variable numbers of bytes, 7 bits per byte, with one bit per byte
used as a termination marker, but I suspect that it can probably use
substantially less space with Golomb coding.

Do you mind if I forward your mail and this one to kragen-discuss?

Kragen

----- End forwarded message -----
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss

[kragen@canonical.org: Re: radix-sorting rational numbers with an efficient serialization of continued fractions]

Reply via email to