On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote: > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf > Absolutely! > > Blocking seems a good approach for most data, where the a priori many > possible values degrade very fast the potential compression gains of a > run-length-encoding (RLE) based scheme. > > But boolean arrays, that are used extremely often as masks in > scientific applications and suffer already from a 8x penalty in > storage would be an excellent candidate to consider RLE. Boolean > arrays are also an interesting way to encode attributes by > 'bit-vectors', i.e. instead of storing an enum column 'car color' with > values in {red, green, blue}, you store three boolean arrays 'red', > 'green', 'blue'. Where this gets interesting is in allowing more > generality, because you don't need a taxonomy, i.e. red and green need > not be exclusive if they are tags on a genetic sequence (or in my > case, an electrophysiological recording). To compute ANDs and ORs you > just have to perform the corresponding bit-wise operations if you > reconstruct the bit-vector or you can use some smart algorithm on the > intervals themselves (as mentioned in another mail, I think, R*Trees > or Nested Containment Lists are two viable candidates). > > I don't know whether it's possible to have such an specialization for > compression of boolean arrays in PyTables. Maybe a simple, > alternative route is to make the chunklength dependent on the > likelihood of repeated data (i.e. the range of the type domain), or at > the very least, special-casing chunklength estimation for booleans to > be somewhat higher than for other datatypes. This again, I think is an > exception that would do justice to the main use-case of PyTables.
Yes, I think you raised a good point here. Well, there are quite a few possibilities to reduce the space of highly redundant data, and the first should be to add a special case in blosc so that, before passing control to blosclz, it first checks for identical data for all the block, and if found, then collapse everything to a counter and a value. This should require a bit more CPU compression effort (so it could be active only at higher compression level), but will lead to far better compression ratios. Another possibility is to add code to deal directly with compressed data, but that should be done more at PyTables (or carray, the package) level, with some help of the blosc compressor. In particular, it would be very interesting to implement interval algebra out of these extremely compressed interval data. > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. > It's a neat package and I hope it gets the appreciation and support it > deserves! Thanks, I also think it can be useful for some situations. But before being more used, more work should be put in the range of operations supported. Also, defining a C API and being able to use it straight from C could help to spread package adoption too. -- Francesc Alted ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users