If you really want to compress sparse vectors, then we need to define a few patterns of sparsity that we can attack well. For one thing, values and indexes should be compressed separately.
For matrices with a very small number of distinct non-default values, dictionary based methods may be of some use. For instance, it is unreasonable to have the values in a trinary matrix take more than one bit per non-zero value in a compressed representation. For other matrices, it might be OK to use a lossy value encoder that quantizes values to get a small-ish dictionary. For matrices with a large number of possible integer values which are highly skewed toward small values (think about counts) then delta or gamma encoding might be good. In general, there should also be an option to store floats instead of doubles if the loss is acceptable. Index compression is done fairly well by simply encoding differences using delta, gamma or PFOR. There are a number of special cases where reference to the previous row might help a little bit. Most of our matrices would do best with simple difference encoding. Structured sparse matrices such as diagonals or banded matrices can obviously have even more compact encoding. Of all of these options, the only think that I think would help would small count encoding with index differences. For the SSVD stuff, for instance, the ability to read A faster would make a significant difference in a few of the phases. Even a 10:1 compression rate would, however, probably only make a small difference in total run-time. On Sat, Sep 3, 2011 at 7:42 PM, Lance Norskog <[email protected]> wrote: > Lucene includes Trie types, which essentially store sets of numbers as a > tree of bit sequences. The set is stored with a common "set of high bits" > as > one value with a bunch of sub-values, one for each actual number. > > The interesting this is that the data structure is stored this way in > memory > and is randomly addressable. You can memory-map a Trie-based SequenceFile > and walk it all sequentially or randomly, unpacking when you please. There > is no unpacking phase required. > > Lucene does range queries directly on this data structure- this is a > testament to the random access efficiency. > > On Sat, Sep 3, 2011 at 11:05 AM, Ted Dunning <[email protected]> > wrote: > > > Compression is likely to help with things like binary matrices or > matrices > > of small counts. Using a binary or trinary random projection will > preserve > > this compressibility for one step, but as soon as we are into the first > QR > > projection, this property will be lost, I expect. > > > > This is the long way of saying that I agree. > > > > On Sat, Sep 3, 2011 at 2:41 AM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > Per above. > > > > > > I noticed i do ask for compression of results and intermediate data. > > > (more of a programming reflex really than any motivated decision). > > > > > > But for data such as vectors, assuming sparse vectors are used where > > > appropriate, compression is not going to win much. > > > > > > On the other hand, if native libraries are enabled, default GZIP codec > > > does not cost much compared to computations etiher. > > > > > > And a third option, maybe we shouldn't put any defaults in at all and > > > leave it for -D options. Which i see as somewhat a problem since > > > hadoop somewhat tries to encapsulate those properties in static > > > methods of classes such as FileOutputFormat, which may imply that the > > > property names are not meant to be part of any user contract and just > > > implementation details of a concrete file format. > > > > > > I am leaning towards enforcing no compression by default. > > > > > > > > > -- > Lance Norskog > [email protected] >
