Re: [Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell

Derek Homeier Tue, 20 Mar 2012 08:23:56 -0700

On 20 Mar 2012, at 14:40, Chao YUE wrote:

> I would be in agree. thanks!
> I use gawk to separate the file into many files by year, then it would be 
> easier to handle.
> anyway, it's not a good practice to produce such huge line txt files....


Indeed it's not, but it's also not good practice to load the entire content 
of text files as python lists into memory, as unfortunately all the numpy 
readers are still doing. But this has been discussed on this list and 
improvements are under way. 
For your problem at hand the textreader Warren Weckesser recently 
made known - can't find the post right now, but you can find it at

https://github.com/WarrenWeckesser/textreader

might be helpful. It is still under construction, but for a plain csv file such 
as yours it should be working already. And since the text parsing is 
implemented in C, it should also give you a huge speedup for your 1/2 GB!

For additional profiling, similar to what David suggested, it would certainly 
be a good idea to read in smaller chunks of the file and write it directly to 
the netCDF file. Note that you can already read single lines at a time with the 
likes of

from StringIO import StringIO
f = open('file.txt'. 'r')
np.genfromtxt(StringIO(f.next()), delimiter=',')

but I don't think it would work this way with textreader, and iterating such a 
small 
loop over lines in Python would beat the point of using a fast reader. 
As your actual data would be < 1GB in numpy, memory usage with textreader 
should also not be critical yet.

Cheers,
                                Derek

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Trying to read 500M txt file using numpy.genfromtxt within ipython shell

Reply via email to