On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider <[email protected]> wrote: > Dear NumPy developers, > > I have to process some big data files with high-frequency > financial data. I am trying to load a delimited text file having > ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The > machine is a Debian Lenny server 32bit with 3GB of memory. Since > the file is just 700MB I am naively assuming that it should fit > into memory in whole. However, when I attempt to load it, python > fills the entire available memory and then fails with > > > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in > genfromtxt > errmsg = "\n".join(errmsg) > MemoryError > > > Is there a way to load this file without crashing? > > Thanks, Hannes > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
>From my experience I might suggest using PyTables (HDF5) as intermediate storage for the data which can be populated iteratively (you'll have to parse the data yourself, marking missing data could be a problem). This of course requires that you know the column schema ahead of time which is one thing that np.genfromtxt will handle automatically. Particularly if you have a large static data set this can be worthwhile as reading the data out of HDF5 will be many times faster than parsing the text file. I believe you can also append rows to the PyTables Table structure in chunks which would be faster than appending one row at a time. hth, Wes _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
