On Thu, Apr 26, 2012 at 04:07, Francesc Alted <fal...@pytables.org> wrote: > On 4/25/12 7:05 AM, Alvaro Tejero Cantero wrote: >> Hi, a minor update on this thread >> >>>> * a bool array of 10**8 elements with True in two separate slices of >>>> length 10**6 each compresses by ~350. Using .wheretrue to obtain >>>> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy >>>> array). The resulting filesize is 248kb, still far from storing the 4 >>>> or 6 integer indexes that define the slices (I am experimenting with >>>> an approach for scientific databases where this is a concern). >>> Oh, you were asking for a 8 to 1 compressor (booleans as bits), but >>> apparently a 350 to 1 is not enough? :) >> Here I expected more from a run-length-like compression scheme. My >> array would be compressible to the following representation: >> >> (0, x) : 0 >> (x, x+10**6) : 1 >> (x+10**6, y) : 0 >> (y, y+10**6) : 1 >> (y+10**6, 10**8) : 0 >> >> or just: >> (x, x+10**6) : 1 >> (y, y+10**6) : 1 >> >> where x and y are two reasonable integers (i.e. in range and with no >> overlap). > > Sure, but this is not the spirit of a compressor adapted to the blocking > technique (in the sense of [1]). For a compressor that works with > blocks, you need to add some metainformation for each block, and that > takes space. A ratio of 350 to 1 is pretty good for say, 32 KB blocks. > > [1] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf
Absolutely! Blocking seems a good approach for most data, where the a priori many possible values degrade very fast the potential compression gains of a run-length-encoding (RLE) based scheme. But boolean arrays, that are used extremely often as masks in scientific applications and suffer already from a 8x penalty in storage would be an excellent candidate to consider RLE. Boolean arrays are also an interesting way to encode attributes by 'bit-vectors', i.e. instead of storing an enum column 'car color' with values in {red, green, blue}, you store three boolean arrays 'red', 'green', 'blue'. Where this gets interesting is in allowing more generality, because you don't need a taxonomy, i.e. red and green need not be exclusive if they are tags on a genetic sequence (or in my case, an electrophysiological recording). To compute ANDs and ORs you just have to perform the corresponding bit-wise operations if you reconstruct the bit-vector or you can use some smart algorithm on the intervals themselves (as mentioned in another mail, I think, R*Trees or Nested Containment Lists are two viable candidates). I don't know whether it's possible to have such an specialization for compression of boolean arrays in PyTables. Maybe a simple, alternative route is to make the chunklength dependent on the likelihood of repeated data (i.e. the range of the type domain), or at the very least, special-casing chunklength estimation for booleans to be somewhat higher than for other datatypes. This again, I think is an exception that would do justice to the main use-case of PyTables. >>>> * how blosc choses the chunklen is black magic for me, but it seems to >>>> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to >>>> 64*1024 when CArraying only one row). >>> Uh? You mean 1 byte as a blocksize? This is certainly a bug. Could >>> you detail a bit more how you achieve this result? Providing an example >>> would be very useful. >> I revisited this issue. While in PyTables CArray the guesses are >> reasonable, the problem is in carray.carray (or in its reporting of >> chunklen). >> >> This is the offender: >> carray((64, 15600000), int16) nbytes: 1.86 GB; cbytes: 1.04 GB; ratio: 1.78 >> cparams := cparams(clevel=5, shuffle=True) >> >> In [87]: x.chunklen >> Out[87]: 1 >> >> Could it be that carray is not reporting the second dimension of the >> chunkshape? (in PyTables, this is 262144) > > Ah yes, this is it. The carray package is not as sophisticated as HDF5, > and it only blocks in the leading dimension. In this case, it is saying > that the block is a complete row. So this is the intended behaviour. Ok, it makes sense, and in my particular use case, the rows do fit in memory, so there is no need for further chunking. >> >> The fact that both PyTable's CArray and carray.carray are named carray >> is a bit confusing. > > Yup, agreed. Don't know what to do here. carray was more a > proof-of-concept than anything else, but if development for it continues > in the future, I should ponder about changing the names. It's a neat package and I hope it gets the appreciation and support it deserves! Cheers, Álvaro. > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users