Compression is likely to help with things like binary matrices or matrices of small counts. Using a binary or trinary random projection will preserve this compressibility for one step, but as soon as we are into the first QR projection, this property will be lost, I expect.
This is the long way of saying that I agree. On Sat, Sep 3, 2011 at 2:41 AM, Dmitriy Lyubimov <[email protected]> wrote: > Per above. > > I noticed i do ask for compression of results and intermediate data. > (more of a programming reflex really than any motivated decision). > > But for data such as vectors, assuming sparse vectors are used where > appropriate, compression is not going to win much. > > On the other hand, if native libraries are enabled, default GZIP codec > does not cost much compared to computations etiher. > > And a third option, maybe we shouldn't put any defaults in at all and > leave it for -D options. Which i see as somewhat a problem since > hadoop somewhat tries to encapsulate those properties in static > methods of classes such as FileOutputFormat, which may imply that the > property names are not meant to be part of any user contract and just > implementation details of a concrete file format. > > I am leaning towards enforcing no compression by default. >
