Hi Ben,

2011/1/4, Ben Elliston <b...@air.net.au>:
> On Mon, Jan 03, 2011 at 05:48:50PM +0100, Francesc Alted wrote:
>
>> Array objects are non-chunked.  In order to use compression, you need
>> to use a CArray:
>> http://www.pytables.org/docs/manual/ch04.html#CArrayClassDescr
>
> Here's my chance to ask a question that I've had for a week or two:
> how are compressed arrays actually implemented?  I gather that the
> array contents are compressed (using the chosen compressor) into the
> HDF5 data file, but are CArrays actually compressed in memory?

Well, yes and no ;-)  In principle they are only compressed on-disk,
but if you access a CArray enough, and it is small enough, then
chances are that it would actually exist in the OS filesystem cache
memory in compressed state.  But this is kind of fake in-memory
compression.  For a true compressed array in-memory, see this other
project of mine:

https://github.com/FrancescAlted/carray

> My data set is larger than physical memory, but has a lot of repeated
> values that lead to ~90% compression.  Thus, it should be possible to
> keep the whole array compressed in memory and decompress chunks of the
> array as necessary.  Is this what PyTables does?

You can try doing this with PyTables, yes.  As I said, PyTables keeps
data compressed on disk in chunks.  Whenever you read a chunk(s) of
the disk-based array, it is decompressed automatically and you receive
the chunk in decompressed form (the opposite goes for writing).
Furthermore, you can perform different operations with compressed
arrays on-disk by using the tables.Expr module:

http://www.pytables.org/moin/ComputingKernel

You may also want to use the carray package mentioned above, but this
is still in early beta (for example, multidimensional arrays are not
supported yet, just bidimensional tables).

HTH,

-- 
Francesc Alted

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to