Hi Stephen,
This sounds fantastic, and exactly what i'm looking for. I'll take a closer
look tomorrow.
Jon
Stephen Simmons <m...@stevesimmons.com> wrote:
>Back in 2006/07 I wrote an optimized histogram function for pytables +
>numpy. The main steps were: - Read in chunksize-sections of the
>pytables
>array so the HDF5 library just needs to decompress full blocks of data
>from disk into memory; eliminates subsequent copying/merging of partial
>
>data blocks - Modify numpy's bincount function to be more suitable for
>high-speed histograms by avoiding data type conversions, eliminate
>initial pass to determine bounds, etc. - Also I modified the numpy
>histogram function to update existing histogram counts. This meant huge
>
>pytables datasets could be histogrammed by reading in successive
>chunks.
>- I also wrote numpy function in C to do weighted averages and simple
>joins. Net result of optimising both the pytables data storage and the
>numpy histogramming was probably a 50x increase in speed. Certainly I
>was getting >1m rows/sec for weighted average histograms, using a 2005
>Dell laptop. I had plans to submit it as a patch to numpy, but work
>priorities at the time took me in another direction. One email about it
>
>with some C code is here:
>http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html
>
>I can send a proper Python source package for it if anyone is
>interested. Regards Stephen ------------------------------ Message: 3
>Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted
><fal...@gmail.com> Subject: Re: [Pytables-users] Histogramming 1000x
>too
>slow To: Discussion list for PyTables
><pytables-users@lists.sourceforge.net> Message-ID:
><50a815af.20...@gmail.com> Content-Type: text/plain;
>charset=ISO-8859-1;
>format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
>
>> Hi all,
>> I am trying to find the best way to make histograms from large data
>> sets. Up to now, I've been just loading entire columns into
>in-memory
>> numpy arrays and making histograms from those. However, I'm
>currently
>> working on a handful of datasets where this is prohibitively memory
>> intensive (causing an out-of-memory kernel panic on a shared machine
>> that you have to open a ticket to have rebooted makes you a little
>> gun-shy), so I am now exploring other options.
>>
>> I know that the Column object is rather nicely set up to act, in some
>> circumstances, like a numpy ndarray. So my first thought is to try
>just
>> creating the histogram out of the Column object directly. This is,
>> however, 1000x slower than loading it into memory and creating the
>> histogram from the in-memory array. Please see my test notebook at:
>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>
>> For such a small table, loading into memory is not an issue. For
>larger
>> tables, though, it is a problem, and I had hoped that pytables was
>> optimized so that histogramming directly from disk would proceed no
>> slower than loading into memory and histogramming. Is there some
>other
>> way of accessing the column (or Array or CArray) data that will make
>> faster histograms?
>
>Indeed a 1000x slowness is quite a lot, but it is important to stress
>that you are doing an disk operation whenever you are accessing a data
>element, and that takes time. Perhaps using Array or CArray would make
>times a bit better, but frankly, I don't think this is going to buy you
>too much speed.
>
>The problem here is that you have too many layers, and this makes
>access
>slower. You may have better luck with carray
>(https://github.com/FrancescAlted/carray), that supports this sort of
>operations, but using a much simpler persistence machinery. At any
>rate, the results are far better than PyTables:
>
>In [6]: import numpy as np
>
>In [7]: import carray as ca
>
>In [8]: N = 1e7
>
>In [9]: a = np.random.rand(N)
>
>In [10]: %time h = np.histogram(a)
>CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
>Wall time: 0.55 s
>
>In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
>
>In [12]: %time h = np.histogram(ad)
>CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
>Wall time: 5.81 s
>
>So, the overhead for using a disk-based array is just 10x (not 1000x as
>in PyTables). I don't know if a 10x slowdown is acceptable to you, but
>in case you need more speed, you could probably implement the histogram
>as a method of the carray class in Cython:
>
>https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
>
>It should not be too difficult to come up with an optimal
>implementation
>using a chunk-based approach.
>
>-- Francesc Alted ------------------------------
>
>
>------------------------------------------------------------------------------
>Monitor your physical, virtual and cloud infrastructure from a single
>web console. Get in-depth insight into apps, servers, databases,
>vmware,
>SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>Pricing starts from $795 for 25 servers or applications!
>http://p.sf.net/sfu/zoho_dev2dev_nov
>_______________________________________________
>Pytables-users mailing list
>Pytables-users@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/pytables-users
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users