Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Chris Barker Thu, 11 Aug 2011 21:49:45 -0700

On 8/10/2011 1:01 PM, Anne Archibald wrote:

There was also some work on a semi-mutable array type that allowed
appending along one axis, then 'freezing' to yield a normal numpy
array (unfortunately I'm not sure how to find it in the mailing list
archives).

That was me, and here is the thread -- however, I'm on vacation, anddon't have the test code, etc with me, but I found the core class. It'senclosed.

The npyio routines (loadtxt as well as genfromtxt) first read in the entire 
data as lists, which creates of course significant overhead, but is not easy to 
circumvent, since numpy arrays are immutable - so you have to first store the 
numbers in some kind of mutable object. One could write a custom parser that 
tries to be somewhat more efficient, e.g. first reading in sub-arrays from a 
smaller buffer. Concatenating those sub-arrays would still require about twice 
the memory of the final array. I don't know if using the array.array type 
(which is mutable) is much more efficient than a list...

Indeed, and are holding all the text as well, which is generally goingto be bigger than the resulting numbers.

Interesting, when I wrote accumulator, I found that it didn't, for themost part, have any performance advantage over accumlating on lists,then converting to arrays -- but there is a memory advantage, so thismay be a good use case. you could do something like (untested):


If your rows are all one dtype:

X = accumulator(dtype=np.float32, block_shape = (num_cols,))

if they are not, then build a custon dtype to hold the rows, and use that:

dt = np.dtype('%id'%num_columns) # create a dtype that holds a row
                                 #num_columns doubles in this case.

# create an accumulator for that dtype
X = accumulator(dtype=dt)

# loop through the file to build the array:
delimiter = ' '
for line in file(fname, 'r'):
     X.append ( np.array(line.split(delimiter), dtype=float) )


X = np.array(X) # gives a regular old array as a copy

I note that converting to a regular array requires a data copy, which,if memoery is tight, might not be good. The solution would be to have away to make a view, so you'd get a regular array from the same data(with maybe the extra buffer space)

I'd like to see this calss get more mature, robust, and betterperforming, but so far it's worked for my use cases. Contributions welcome.


-Chris



--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov

#!/usr/bin/env python

"""
accumulator class

Designed to be used as an expandable numpy array, to accumulate values, rather
than a python list.

Note that slices return copies, rather than views, unlike regular numpy arrays.
This is so that the buffer can be re-allocated without messing up any views.

ONly works for 1-d arrays at the moment.

"""
import numpy as np

class accumulator:
    #A few parameters
    DEFAULT_BUFFER_SIZE = 128
    BUFFER_EXTEND_SIZE = 1.25 # array.array uses 1+1/16 -- that seems small to 
me.
    def __init__(self, object=None, dtype=np.float, block_shape=()):
        """
        proper docs here
        
        note: a scalar accumulator doesn't really sense, so you get a length-1 
array instead.

        block_shape specifies the other dimensions of the array, so that it 
will be of shape:
          (n,) + block_shape
        block_shape is ignored if object is provided, and the shape of the array
        is determined from the shape of the provided object.
        
        If neither object nor block_shape is provided, and empty 1-d array is 
created
        """
        if object is None:
            buffer = np.empty((0,)+block_shape, dtype = dtype)
        else:
            buffer = np.array(object, dtype=dtype, copy=True)
            if buffer.shape == ():# to make sure we don't have a scalar
                buffer.shape = (1,)
        self._length      = buffer.shape[0]
        self._block_shape = buffer.shape[1:]
        ## add the padding to the buffer
        shape = ( max(self.DEFAULT_BUFFER_SIZE, 
buffer.shape[0]*self.BUFFER_EXTEND_SIZE), ) + buffer.shape[1:]
        buffer.resize( shape )
        self.__buffer = buffer

        
    ##fixme: 
    ## using @property seems to give a getter, but setting then overrides it
    ## which seems terribly prone to error.
    @property
    def dtype(self):
        return self.__buffer.dtype

    @property
    def bufferlength(self):
        """
        the size of the internal buffer
        """
        return self.__buffer.shape[0]

    @property
    def shape(self):
        """
        To be compatible with ndarray.shape
        (only the getter!) 
        """
        return (self._length,) + self._block_shape
    
    def __len__(self):
        return self._length
        
    def __array__(self, dtype=None):
        """
        a.__array__(|dtype) -> copy of array.
    
        Always returns a copy array, so that buffer doesn't have any references 
to it.
        """
        return np.array(self.__buffer[:self._length], dtype=dtype, copy=True)

    def append(self, item):
        """
        add a new item to the end of the array.
        
        It should be one less dimension than the array: i.e. a.shape[1:]
        if the itme is a smaller shape, it needs to be broadcastable to that 
shape. 
        """
        try:
            self.__buffer[self._length] = item
            self._length += 1
        except IndexError: # the buffer is not big enough or wrong shape entries
            #fixme: test for wrong shape?
            self.resize(self._length*self.BUFFER_EXTEND_SIZE,)
            self.append(item)

    def extend(self, items):
        """
        add a sequence of new items to the end of the array
        """
        try:
            self.__buffer[self._length:self._length+len(items)] = items
            self._length += len(items)
        except ValueError: # the buffer is not big enough, or wrong shape
            items = np.asarray(items, dtype=self.dtype)
            if items.shape[1:] <> self._block_shape:
                raise 
            self.resize( (self._length+len(items)) * self.BUFFER_EXTEND_SIZE )
            self.extend(items)

    def resize(self, newsize):
        """
        resize the internal buffer
        
        it takes a scalar for the length of the the first axis appropriately.
        
        You might want to do this to speed things up if you know you want it
        to be a lot bigger eventually
        """
        if newsize < self._length:
            raise ValueError("accumulator buffer cannot be made smaller that 
the data")
        shape = (newsize,) + self._block_shape
        self.__buffer.resize(shape)

    def fitbuffer(self):
        """
        re-sizes the buffer so that it fits the data, rather than having extra 
space

        """
        self.__buffer.resize( (self._length,) + self._block_shape )
        
    def __getitem__(self, index):
        ## fixme -- this needs to be expanded to n-d!
        if index > self._length-1:
            raise IndexError("index out of bounds")
        elif index < 0:
            index = self._length - 1
        return self.__buffer[index]
    
    ## apparently __getslice__ is depricated!
    def __getslice__(self, i, j):
        """
        a.__getslice__(i, j) <==> a[i:j]
    
        Use of negative indices is not supported.
        
        This returns a COPY, not a view, unlike numpy arrays
        This is required as the data buffer needs to be able to change.
        """
        #fixme -- this needs to be updated: it should be in __getitem__
        #         and support 2d
        j  = min(j, self._length)
        return self.__buffer[i:j].copy()
    def __delitem__(self):
        raise NotImplimentedError
    def __eq__(self, other):
        return self.__buffer[:self._length] == other
    ## fixme: other comparison method here 
    def __str__(self):
        return self.__buffer[:self.shape[0]].__str__()
    def __repr__(self):
        return "accumulator%s"%self.__buffer[:self.shape[0]].__repr__()[5:]

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Efficient way to load a 1Gb file?

Reply via email to