Lucene includes Trie types, which essentially store sets of numbers as a
tree of bit sequences. The set is stored with a common "set of high bits" as
one value with a bunch of sub-values, one for each actual number.

The interesting this is that the data structure is stored this way in memory
and is randomly addressable. You can memory-map a Trie-based SequenceFile
and walk it all sequentially or randomly, unpacking when you please. There
is no unpacking phase required.

Lucene does range queries directly on this data structure- this is a
testament to the random access efficiency.

On Sat, Sep 3, 2011 at 11:05 AM, Ted Dunning <[email protected]> wrote:

> Compression is likely to help with things like binary matrices or matrices
> of small counts.  Using a binary or trinary random projection will preserve
> this compressibility for one step, but as soon as we are into the first QR
> projection, this property will be lost, I expect.
>
> This is the long way of saying that I agree.
>
> On Sat, Sep 3, 2011 at 2:41 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Per above.
> >
> > I noticed i do ask for compression of results and intermediate data.
> > (more of a programming reflex really than any motivated decision).
> >
> > But for data such as vectors, assuming sparse vectors are used where
> > appropriate, compression is not going to win much.
> >
> > On the other hand, if native libraries are enabled, default GZIP codec
> > does not cost much compared to computations etiher.
> >
> > And a third option, maybe we shouldn't put any defaults in at all and
> > leave it for -D options. Which i see as somewhat a problem since
> > hadoop somewhat tries to encapsulate those properties in static
> > methods of classes such as  FileOutputFormat, which may imply that the
> > property names are not meant to be part of any user contract and just
> > implementation details of a concrete file format.
> >
> > I am leaning towards enforcing no compression by default.
> >
>



-- 
Lance Norskog
[email protected]

Reply via email to