Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Anne Archibald Wed, 10 Aug 2011 13:02:07 -0700

There was also some work on a semi-mutable array type that allowed
appending along one axis, then 'freezing' to yield a normal numpy
array (unfortunately I'm not sure how to find it in the mailing list
archives). One could write such a setup by hand, using mmap() or
realloc(), but I'd be inclined to simply write a filter that converted
the text file to some sort of binary file on the fly, value by value.
Then the file can be loaded in or mmap()ed.  A 1 Gb text file is a
miserable object anyway, so it might be desirable to convert to (say)
HDF5 and then throw away the text file.


Anne

On 10 August 2011 15:43, Derek Homeier
<de...@astro.physik.uni-goettingen.de> wrote:
> On 10 Aug 2011, at 19:22, Russell E. Owen wrote:
>
>> A coworker is trying to load a 1Gb text data file into a numpy array
>> using numpy.loadtxt, but he says it is using up all of his machine's 6Gb
>> of RAM. Is there a more efficient way to read such text data files?
>
> The npyio routines (loadtxt as well as genfromtxt) first read in the entire 
> data as lists, which creates of course significant overhead, but is not easy 
> to circumvent, since numpy arrays are immutable - so you have to first store 
> the numbers in some kind of mutable object. One could write a custom parser 
> that tries to be somewhat more efficient, e.g. first reading in sub-arrays 
> from a smaller buffer. Concatenating those sub-arrays would still require 
> about twice the memory of the final array. I don't know if using the 
> array.array type (which is mutable) is much more efficient than a list...
> To really avoid any excess memory usage you'd have to know the total data 
> size in advance - either by reading in the file in a first pass to count the 
> rows, or explicitly specifying it to a custom reader. Basically, assuming a 
> completely regular file without missing values etc., you could then read in 
> the data like
>
> X = np.zeros((n_lines, n_columns), dtype=float)
> delimiter = ' '
> for n, line in enumerate(file(fname, 'r')):
>    X[n] = np.array(line.split(delimiter), dtype=float)
>
> (adjust delimiter and dtype as needed...)
>
> HTH,
>                                                        Derek
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Reply via email to