Python sorting 10000 records of 10000 floats for each record, finding the max,
min, and mean of entire 100,000,000 32 bit float array (400 MB) on a 6 year old
white imac.
*11.6 seconds.
*This doesn't include the time to generate the 400 MB of random (normal) data.
Try it on your own computer. Here's the copy-paste from mine:
py> import timeit
py> timeit.timeit('big_data.sort(axis=0), big_data.mean(); big_data.max();
big_data.min();',
'import numpy; big_data=numpy.random.normal(10,
size=1e8).reshape((1e4,1e4)); print "random data made, starting..."',
number=1)
random data made, starting...
11.597978115081787
James
On Sep 12, 2012, at 8:32 AM, Jacob Keller wrote:
> Dear List,
>
> since this probably comes up a lot in manipulation of pdb/reflection files
> and so on, I was curious what people thought would be the best language for
> the following: I have some huge (100s MB) tables of tab-delimited data on
> which I would like to do some math (averaging, sigmas, simple arithmetic,
> etc) as well as some sorting and rejecting. It can be done in Excel, but this
> is exceedingly slow even in 64-bit, so I am looking to do it through some
> scripting. Just as an example, a "sort" which takes >10 min in Excel takes
> ~10 sec max with the unix command sort (seems crazy, no?). Any suggestions?
>
> Thanks, and sorry for being off-topic,
>
> Jacob
>
> --
> *******************************************
> Jacob Pearson Keller
> Northwestern University
> Medical Scientist Training Program
> email: [email protected]
> *******************************************