Re: [Pytables-users] Histogramming 1000x too slow

Anthony Scopatz Sat, 17 Nov 2012 09:50:25 -0800

On Fri, Nov 16, 2012 at 7:33 PM, Jon Wilson <j...@fnal.gov> wrote:

> Hi Anthony,
> I don't think that either of these help me here (unless I've misunderstood
> something). I need to fill the histogram with every row in the table, so
> querying doesn't gain me anything. (especially since the query just returns
> an iterator over rows) I also don't need (at the moment) to compute any
> function of the column data, just count (weighted) entries into various
> bins. I suppose I could write one Expr for each bin of my histogram, but
> that seems dreadfully inefficient and probably difficult to maintain.
>


Hi Jon,

Barring changes to numexpr itself, this is exactly what I am suggesting.
 Well,, either writing one query expr per bin or (more cleverly) writing
one expr which when evaluated for a row returns the integer bin number (1,
2, 3,...) this row falls in.  Then you can simply count() for each bin
number.

For example, if you wanted to histogram data which ran from [0,100] into 10
bins, then the expr "r/10" into a dtype=int would do the trick.  This has
the advantage of only running over the data once.  (Also, I am not
convinced that running over the data multiple times is less efficient than
doing row-based iteration.  You would have to test it on your data to find
out.)


> It is a reduction operation, and would greatly benefit from chunking, I
> expect. Not unlike sum(), which is implemented as a specially supported
> reduction operation inside numexpr (buggily, last I checked). I suspect
> that a substantial improvement in histogramming requires direct support
> from either pytables or from numexpr. I don't suppose that there might be a
> chunked-reduction interface exposed somewhere that I could hook into?
>

This is definitively as feature to request from numexpr.

Be Well
Anthony


>  Jon
>
> Anthony Scopatz <scop...@gmail.com> wrote:
>>
>>  On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <j...@fnal.gov> wrote:
>>
>>> Hi all,
>>> I am trying to find the best way to make histograms from large data
>>> sets.  Up to now, I've been just loading entire columns into in-memory
>>> numpy arrays and making histograms from those.  However, I'm currently
>>> working on a handful of datasets where this is prohibitively memory
>>> intensive (causing an out-of-memory kernel panic on a shared machine
>>> that you have to open a ticket to have rebooted makes you a little
>>> gun-shy), so I am now exploring other options.
>>>
>>> I know that the Column object is rather nicely set up to act, in some
>>> circumstances, like a numpy ndarray.  So my first thought is to try just
>>> creating the histogram out of the Column object directly. This is,
>>> however, 1000x slower than loading it into memory and creating the
>>> histogram from the in-memory array.  Please see my test notebook at:
>>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>>
>>> For such a small table, loading into memory is not an issue.  For larger
>>> tables, though, it is a problem, and I had hoped that pytables was
>>> optimized so that histogramming directly from disk would proceed no
>>> slower than loading into memory and histogramming. Is there some other
>>> way of accessing the column (or Array or CArray) data that will make
>>> faster histograms?
>>>
>>
>> Hi Jon,
>>
>> This is not surprising since the column object itself is going to be
>> iterated
>> over per row.  As you found, reading in each row individually will be
>> prohibitively expensive as compared to reading in all the data at one.
>>
>> To do this in the right way for data that is larger than system memory,
>> you
>> need to read it in in chunks.  Luckily there are tools to help you
>> automate
>> this process already in PyTables.  I would recommend that you use
>> expressions [1] or queries [2] to do your historgramming more efficiently.
>>
>> Be Well
>> Anthony
>>
>> 1. http://pytables.github.com/usersguide/libref/expr_class.html
>> 2.
>> http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
>>
>>
>>
>>> Regards,
>>> Jon
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>
>> ------------------------------
>>
>> Monitor your physical, virtual and cloud infrastructure from a single
>>
>>
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>
>> ------------------------------
>>
>> Pytables-users mailing list
>>
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Histogramming 1000x too slow

Reply via email to