On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote:
> Hi, a minor update on this thread
>
>>> * a bool array of 10**8 elements with True in two separate slices of
>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain
>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
>>> array). The resulting filesize is 248kb, still far from storing the 4
>>> or 6 integer indexes that define the slices (I am experimenting with
>>> an approach for scientific databases where this is a concern).
>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but
>> apparently a 350 to 1 is not enough? :)
> Here I expected more from a run-length-like compression scheme. My
> array would be compressible to the following representation:
>
> (0, x) : 0
> (x, x+10**6) : 1
> (x+10**6, y) : 0
> (y, y+10**6) : 1
> (y+10**6, 10**8) : 0
>
> or just:
> (x, x+10**6) : 1
> (y, y+10**6) : 1
>
> where x and y are two reasonable integers (i.e. in range and with no overlap).

Sure, but this is not the spirit of a compressor adapted to the blocking 
technique (in the sense of [1]).  For a compressor that works with 
blocks, you need to add some metainformation for each block, and that 
takes space.  A ratio of 350 to 1 is pretty good for say, 32 KB blocks.

[1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf

>
>>> * how blosc choses the chunklen is black magic for me, but it seems to
>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
>>> 64*1024 when CArraying only one row).
>> Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could
>> you detail a bit more how you achieve this result?  Providing an example
>> would be very useful.
> I revisited this issue. While in PyTables CArray the guesses are
> reasonable, the problem is in carray.carray (or in its reporting of
> chunklen).
>
> This is the offender:
> carray((64, 15600000), int16)  nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78
>    cparams := cparams(clevel=5, shuffle=True)
>
> In [87]: x.chunklen
> Out[87]: 1
>
> Could it be that carray is not reporting the second dimension of the
> chunkshape? (in PyTables, this is 262144)

Ah yes, this is it.  The carray package is not as sophisticated as HDF5, 
and it only blocks in the leading dimension.  In this case, it is saying 
that the block is a complete row.  So this is the intended behaviour.

>
> The fact that both PyTable's CArray and carray.carray are named carray
> is a bit confusing.

Yup, agreed.  Don't know what to do here.  carray was more a 
proof-of-concept than anything else, but if development for it continues 
in the future, I should ponder about changing the names.

-- 
Francesc Alted


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to