A new paper on a similar topic: http://lemire.me/blog/archives/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/
http://arxiv.org/pdf/1209.2137v1.pdf "we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding." I haven't read the details, but at a quick skim I think they reach the same conclusions that PackedBinary (FOR) is significantly faster than PFOR, and has only slightly less compression. But they then go on to show that this benefit is minimal compared to the advantage of using vectorized SIMD instructions. I'm leery of their thought processes, though: "According to our analysis, one decoding step of this algorithms would require at least 11 assembly instructions. In contrast, one step of our interleaved bit unpacking from 3 (used by our SIMD-BP128) would typically require only 5 and 10 assembly instructions in a typical and worst case, respectively." In a paper that is about the benefits of SIMD, an argument presuming that fewer instructions means faster is radically out of place. Even without SIMD, different instructions take vastly different numbers of cycles. I'm tempted to think that one of the authors "gets it" and the other doesn't. --nate
