On 4/26/12 4:04 AM, Alvaro Tejero Cantero wrote:
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf
> Absolutely!
>
> Blocking seems a good approach for most data, where the a priori many
> possible values degrade very fast the potential compression gains of a
> run-length-encoding (RLE) based scheme.
>
> But boolean arrays, that are used extremely often as masks in
> scientific applications and suffer already from a 8x penalty in
> storage would be an excellent candidate to consider RLE. Boolean
> arrays are also an interesting way to encode attributes by
> 'bit-vectors', i.e. instead of storing an enum column 'car color' with
> values in {red, green, blue}, you store three boolean arrays 'red',
> 'green', 'blue'. Where this gets interesting is in allowing more
> generality, because you don't need a taxonomy, i.e. red and green need
> not be exclusive if they are tags on a genetic sequence (or in my
> case, an electrophysiological recording). To compute ANDs and ORs you
> just have to perform the corresponding bit-wise operations if you
> reconstruct the bit-vector or you can use some smart algorithm on the
> intervals themselves (as mentioned in another mail, I think, R*Trees
> or Nested Containment Lists are two viable candidates).
>
> I don't know whether it's possible to have such an specialization for
> compression of boolean arrays in PyTables. Maybe a simple,
> alternative route is to make the chunklength dependent on the
> likelihood of repeated data (i.e. the range of the type domain), or at
> the very least, special-casing chunklength estimation for booleans to
> be somewhat higher than for other datatypes. This again, I think is an
> exception that would do justice to the main use-case of PyTables.

Yes, I think you raised a good point here.  Well, there are quite a few 
possibilities to reduce the space of highly redundant data, and the 
first should be to add a special case in blosc so that, before passing 
control to blosclz, it first checks for identical data for all the 
block, and if found, then collapse everything to a counter and a value.  
This should require a bit more CPU compression effort (so it could be 
active only at higher compression level), but will lead to far better 
compression ratios.

Another possibility is to add code to deal directly with compressed 
data, but that should be done more at PyTables (or carray, the package) 
level, with some help of the blosc compressor.  In particular, it would 
be very interesting to implement interval algebra out of these extremely 
compressed interval data.

> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.
> It's a neat package and I hope it gets the appreciation and support it 
> deserves!

Thanks, I also think it can be useful for some situations.  But before 
being more used, more work should be put in the range of operations 
supported.  Also, defining a C API and being able to use it straight 
from C could help to spread package adoption too.

-- 
Francesc Alted


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to