Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Derek Homeier Wed, 10 Aug 2011 12:43:55 -0700

On 10 Aug 2011, at 19:22, Russell E. Owen wrote:

> A coworker is trying to load a 1Gb text data file into a numpy array 
> using numpy.loadtxt, but he says it is using up all of his machine's 6Gb 
> of RAM. Is there a more efficient way to read such text data files?


The npyio routines (loadtxt as well as genfromtxt) first read in the entire 
data as lists, which creates of course significant overhead, but is not easy to 
circumvent, since numpy arrays are immutable - so you have to first store the 
numbers in some kind of mutable object. One could write a custom parser that 
tries to be somewhat more efficient, e.g. first reading in sub-arrays from a 
smaller buffer. Concatenating those sub-arrays would still require about twice 
the memory of the final array. I don't know if using the array.array type 
(which is mutable) is much more efficient than a list...
To really avoid any excess memory usage you'd have to know the total data size 
in advance - either by reading in the file in a first pass to count the rows, 
or explicitly specifying it to a custom reader. Basically, assuming a 
completely regular file without missing values etc., you could then read in the 
data like 

X = np.zeros((n_lines, n_columns), dtype=float)
delimiter = ' '
for n, line in enumerate(file(fname, 'r')):
    X[n] = np.array(line.split(delimiter), dtype=float)

(adjust delimiter and dtype as needed...)

HTH,
                                                        Derek

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Reply via email to