Thank you so much. Your solution works! I greatly appreciate your help.
sturlamolden wrote: > oyekomova wrote: > > > Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem > > in reading the file into memory. I am just running Istvan's code that > > was posted earlier. > > You have a CSV file of about 520 MiB, which is read into memory. Then > you have a list of list of floats, created by list comprehension, which > is larger than 274 MiB. Additionally you try to allocate a NumPy array > slightly larger than 274 MiB. Now your process is already exceeding 1 > GiB, and you are probably running other processes too. That is why you > run out of memory. > > So you have three options: > > 1. Buy more RAM. > > 2. Low-level code a csv-reader in C. > > 3. Read the data in chunks. That would mean something like this: > > > import time, csv, random > import numpy > > def make_data(rows=6E6, cols=6): > fp = open('data.txt', 'wt') > counter = range(cols) > for row in xrange( int(rows) ): > vals = map(str, [ random.random() for x in counter ] ) > fp.write( '%s\n' % ','.join( vals ) ) > fp.close() > > def read_test(): > start = time.clock() > arrlist = None > r = 0 > CHUNK_SIZE_HINT = 4096 * 4 # seems to be good > fid = file('data.txt') > while 1: > chunk = fid.readlines(CHUNK_SIZE_HINT) > if not chunk: break > reader = csv.reader(chunk) > data = [ map(float, row) for row in reader ] > arrlist = [ numpy.array(data,dtype=float), arrlist ] > r += arrlist[0].shape[0] > del data > del reader > del chunk > print 'Created list of chunks, elapsed time so far: ', time.clock() > - start > print 'Joining list...' > data = numpy.empty((r,arrlist[0].shape[1]),dtype=float) > r1 = r > while arrlist: > r0 = r1 - arrlist[0].shape[0] > data[r0:r1,:] = arrlist[0] > r1 = r0 > del arrlist[0] > arrlist = arrlist[0] > print 'Elapsed time:', time.clock() - start > > make_data() > read_test() > > This can process a CSV file of 6 million rows in about 150 seconds on > my laptop. A CSV file of 1 million rows takes about 25 seconds. > > Just reading the 6 million row CSV file ( using fid.readlines() ) takes > about 40 seconds on my laptop. Python lists are not particularly > efficient. You can probably reduce the time to ~60 seconds by writing a > new CSV reader for NumPy arrays in a C extension. -- http://mail.python.org/mailman/listinfo/python-list