Hi all, I am trying to find the best way to make histograms from large data sets. Up to now, I've been just loading entire columns into in-memory numpy arrays and making histograms from those. However, I'm currently working on a handful of datasets where this is prohibitively memory intensive (causing an out-of-memory kernel panic on a shared machine that you have to open a ticket to have rebooted makes you a little gun-shy), so I am now exploring other options.
I know that the Column object is rather nicely set up to act, in some circumstances, like a numpy ndarray. So my first thought is to try just creating the histogram out of the Column object directly. This is, however, 1000x slower than loading it into memory and creating the histogram from the in-memory array. Please see my test notebook at: http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html For such a small table, loading into memory is not an issue. For larger tables, though, it is a problem, and I had hoped that pytables was optimized so that histogramming directly from disk would proceed no slower than loading into memory and histogramming. Is there some other way of accessing the column (or Array or CArray) data that will make faster histograms? Regards, Jon ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users