Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012: > That's pretty good. That's faster than pandas's csv-module+Cython > approach almost certainly (but I haven't run your code to get a read > on how much my hardware makes a difference), but that's not shocking > at all: > > In [1]: df = DataFrame(np.random.randn(350000, 32)) > > In [2]: df.to_csv('/home/wesm/tmp/foo.csv') > > In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') > CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s > Wall time: 7.04 s > > I must think that skipping the process of creating 11.2 mm Python > string objects and then individually converting each of them to float. > > Note for reference (i'm skipping the first row which has the column > labels from above): > > In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', > dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: > 0.48 s, total: 24.65 s > Wall time: 24.67 s > > In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', > delimiter=',', skiprows=1) > CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s > Wall time: 11.32 s > > In this last case for example, around 500 MB of RAM is taken up for an > array that should only be about 80-90MB. If you're a data scientist > working in Python, this is _not good_.
It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files? http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ The dtype is [('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)] -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion