Lucene includes Trie types, which essentially store sets of numbers as a tree of bit sequences. The set is stored with a common "set of high bits" as one value with a bunch of sub-values, one for each actual number.
The interesting this is that the data structure is stored this way in memory and is randomly addressable. You can memory-map a Trie-based SequenceFile and walk it all sequentially or randomly, unpacking when you please. There is no unpacking phase required. Lucene does range queries directly on this data structure- this is a testament to the random access efficiency. On Sat, Sep 3, 2011 at 11:05 AM, Ted Dunning <[email protected]> wrote: > Compression is likely to help with things like binary matrices or matrices > of small counts. Using a binary or trinary random projection will preserve > this compressibility for one step, but as soon as we are into the first QR > projection, this property will be lost, I expect. > > This is the long way of saying that I agree. > > On Sat, Sep 3, 2011 at 2:41 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > Per above. > > > > I noticed i do ask for compression of results and intermediate data. > > (more of a programming reflex really than any motivated decision). > > > > But for data such as vectors, assuming sparse vectors are used where > > appropriate, compression is not going to win much. > > > > On the other hand, if native libraries are enabled, default GZIP codec > > does not cost much compared to computations etiher. > > > > And a third option, maybe we shouldn't put any defaults in at all and > > leave it for -D options. Which i see as somewhat a problem since > > hadoop somewhat tries to encapsulate those properties in static > > methods of classes such as FileOutputFormat, which may imply that the > > property names are not meant to be part of any user contract and just > > implementation details of a concrete file format. > > > > I am leaning towards enforcing no compression by default. > > > -- Lance Norskog [email protected]
