On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fal...@pytables.org> wrote:
> On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
>> Hi, a minor update on this thread
>>
>>>> * a bool array of 10**8 elements with True in two separate slices of
>>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>>> array). The resulting filesize is 248kb, still far from storing the 4
>>>> or 6 integer indexes that define the slices (I am experimenting with
>>>> an approach for scientific databases where this is a concern).
>>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>>> apparently a 350 to 1 is not enough? :)
>> Here I expected more from a run-length-like compression scheme. My
>> array would be compressible to the following representation:
>>
>> (0, x) : 0
>> (x, x+10**6) : 1
>> (x+10**6, y) : 0
>> (y, y+10**6) : 1
>> (y+10**6, 10**8) : 0
>>
>> or just:
>> (x, x+10**6) : 1
>> (y, y+10**6) : 1
>>
>> where x and y are two reasonable integers (i.e. in range and with no 
>> overlap).
>
> Sure, but this is not the spirit of a compressor adapted to the blocking
> technique (in the sense of [1]).  For a compressor that works with
> blocks, you need to add some metainformation for each block, and that
> takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.
>
> [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

Absolutely!

Blocking seems a good approach for most data, where the a priori many
possible values degrade very fast the potential compression gains of a
run-length-encoding (RLE) based scheme.

But boolean arrays, that are used extremely often as masks in
scientific applications and suffer already from a 8x penalty in
storage would be an excellent candidate to consider RLE. Boolean
arrays are also an interesting way to encode attributes by
'bit-vectors', i.e. instead of storing an enum column 'car color' with
values in {red, green, blue}, you store three boolean arrays 'red',
'green', 'blue'. Where this gets interesting is in allowing more
generality, because you don't need a taxonomy, i.e. red and green need
not be exclusive if they are tags on a genetic sequence (or in my
case, an electrophysiological recording). To compute ANDs and ORs you
just have to perform the corresponding bit-wise operations if you
reconstruct the bit-vector or you can use some smart algorithm on the
intervals themselves (as mentioned in another mail, I think, R*Trees
or Nested Containment Lists are two viable candidates).

I don't know whether it's possible to have such an specialization for
compression of boolean arrays in PyTables. Maybe a simple,
alternative route is to make the chunklength dependent on the
likelihood of repeated data (i.e. the range of the type domain), or at
the very least, special-casing chunklength estimation for booleans to
be somewhat higher than for other datatypes. This again, I think is an
exception that would do justice to the main use-case of PyTables.

>>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>>> 64*1024 when CArraying only one row).
>>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>>> you detail a bit more how you achieve this result?  Providing an example
>>> would be very useful.
>> I revisited this issue. While in PyTables CArray the guesses are
>> reasonable, the problem is in carray.carray (or in its reporting of
>> chunklen).
>>
>> This is the offender:
>> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>>    cparams := cparams(clevel=5, shuffle=True)
>>
>> In [87]: x.chunklen
>> Out[87]: 1
>>
>> Could it be that carray is not reporting the second dimension of the
>> chunkshape? (in PyTables, this is 262144)
>
> Ah yes, this is it.  The carray package is not as sophisticated as HDF5,
> and it only blocks in the leading dimension.  In this case, it is saying
> that the block is a complete row.  So this is the intended behaviour.

Ok, it makes sense, and in my particular use case, the rows do fit in
memory, so there is no need for further chunking.

>>
>> The fact that both PyTable's CArray and carray.carray are named carray
>> is a bit confusing.
>
> Yup, agreed.  Don't know what to do here.  carray was more a
> proof-of-concept than anything else, but if development for it continues
> in the future, I should ponder about changing the names.

It's a neat package and I hope it gets the appreciation and support it deserves!

Cheers,

Álvaro.

> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to