Re: [Pytables-users] Table.where and conditions across tables

Francesc Alted Thu, 29 Mar 2012 16:56:01 -0700

On 3/29/12 10:49 AM, Alvaro Tejero Cantero wrote:
>>>    What is your advice on how to monitor the use of
>>> memory? (I need this until PyTables is second skin).
>> top?
> I had so far used it only in a very rudimentary way and found the man
> page quite intimidating. Would you care to share your tips for this
> particular scenario? (e.g. how do you keep the ipython process
> 'focused'?)


Well, top by default keeps the most CPU consuming process always on the 
top (hence the name), so I think it is quite easy to spot the 
interesting process.  vmstat is another interesting utility, but will 
report only about general virtual memory consumption, not on a 
per-process basis.

Finally if you can afford to instrument your code and you use Linux (I 
assume this is the case), then you may want to use a small routine that 
tells your the memory used by the caller process each time it is 
called.  Here it is an example on how this is used in PyTables test suite:

https://github.com/PyTables/PyTables/blob/master/tables/tests/common.py#L483

I'm sure you will figure out how to use it in your own scenario.

>
>>> It is very rewarding to see that these numexpr's are 3-4 times faster
>>> than the same with arrays in memory. However, I didn't find a way to
>>> set the number of threads used
>> Well, you can use the `MAX_THREADS` variable in 'parameters.py', but
>> this do not offer separate controls for numexpr and blosc.  Feel free to
>> open a ticket asking for imporving this functionality.
> Ok, I opened the following tickets (since I have to build the
> application first and then revisit the infrastructural issues, I
> cannot do more about them now):
>
> * one for implementation of references
> https://github.com/PyTables/PyTables/issues/140
> * one for the estimation of dataset (group?) size
> https://github.com/PyTables/PyTables/issues/141
> * one for an interface function to set MAX_THREADS for numexpr
> independently of blosc's
> https://github.com/PyTables/PyTables/issues/142

Excellent. Thanks!

>>> When evaluating the blosc benchmarks I found that in my system with
>>> two 6-core processors , using 12 is best for writing and 6 for
>>> reading. Interesting...
>> Yes, it is :)
> Are you interested in my .out bench output file for the
> SyntheticBenchmarks page?

Yes, I am! And if you can produce the matplotlib figures, that would be 
much rejoice :)

>>> Another question (maybe for a separate thread): is there any way to
>>> shrink memory usage of booleans to 1 bit? It might well be that this
>>> optimizes the use of the memory bus (at some processing cost). But I
>>> am not aware of a numpy container for this.
>> Maybe a compressed array?  That would lead to using less that 1 bit per
>> element in many situations.  If you are interested in this, look into:
>>
>> https://github.com/FrancescAlted/carray
> Ok, I did some playing around with this:
>
> * a bool array of 10**8 elements with True in two separate slices of
> length 10**6 each compresses by ~350. Using .wheretrue to obtain
> indices is faster by a factor of 2 to 3 than np.nonzero(normal numpy
> array). The resulting filesize is 248kb, still far from storing the 4
> or 6 integer indexes that define the slices (I am experimenting with
> an approach for scientific databases where this is a concern).

Oh, you were asking for a 8 to 1 compressor (booleans as bits), but 
apparently a 350 to 1 is not enough? :)

>
> * a sample of my normal electrophysiological data (15M Int16 data
> points) compresses by about 1.7-1.8.

Well, I was expecting something more for these time series data, but it 
is not that bad for int16.  Probably int32 or float64 would reach better 
compression rates.

>
> * how blosc choses the chunklen is black magic for me, but it seems to
> be quite spot-on. (e.g. it changed from '1' for a 64x15M array to
> 64*1024 when CArraying only one row).

Uh?  You mean 1 byte as a blocksize?  This is certainly a bug.  Could 
you detail a bit more how you achieve this result?  Providing an example 
would be very useful.

>
> * A quick way to know how well your data will compress in PyTables if
> you will be using blosc is to test in the REPL with CArray. I guess
> for the other compressors we still need to go (for the moment) to
> checking filesystem-reported sizes.

Just be sure that you experiment with different chunklengths by using 
the `chunklen` parameter in carray constructor too.

-- 
Francesc Alted


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Table.where and conditions across tables

Reply via email to