Re: slow Golomb-Rice-coded sets in Python (with an example spellchecker)

Dave Long Sun, 29 Jul 2012 06:28:38 -0700

There probably aren’t any hash collisions in this set ....

The construction of the golomb-coded set reminds me of a method foruniformly sampling simplices:

Generate n uniform random numbers in the range [0,1], sort them, thentake the differences (including the differences to 0 at the bottomend and 1 at the top). This results in (nicely homogenous)barycentric coordinates; to find the sample point, simply multiplyeach vertex by its barycentric weight and sum their contributions.

The fact that the differences in golomb-coding* are geometricallydistributed (much more likely to be small than large) has as itscontinuous corollary that the distance to the boundary of the simplexin high enough dimension is also much more likely to be small thanlarge (in the continuous case, exponentially distributed?); in otherwords, we recover the folk wisdom that approximation (and hencesearch) are difficult in high dimensions because very few of thepoints of a given volume are "close" to its centroid. This alsoexplains why jitter is difficult to avoid: there is only one way tobe exactly in tempo, and many more ways to be off.

(in the discrete case, the birthday paradox is also a reflection ofhigh dimensional shapes being almost all surface and very littlecontent, resp. that events naturally clump)


-Dave

* thinking-out-loud, here is a somewhat self-delimiting octet-friendly variation on rice-coding: code each six bits of theremainder as a bare utf-8 continuation byte, and each non-zerosixteen of the quotient as a normal utf-8 character. If useful, I'dpropose calling it "orzo-coding", as it wouldn't be as fine-grainedas rice-coding but could be prepared relatively quickly and easily...


--
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss

Re: slow Golomb-Rice-coded sets in Python (with an example spellchecker)

Reply via email to