Hi, Yes, working with binary formats is the way to go when you have large data. But for further reference, Dask[1] fits perfectly for your use case, see below how I process a 7Gb text file under 17 seconds (in a laptop: mbp + quad-core + ssd).
# Create roughly ~7Gb worth text data. In [40]: import numpy as np In [41]: x = np.random.random((60, 5000000)) In [42]: %time np.savetxt('data.txt', x) CPU times: user 4min 28s, sys: 14.8 s, total: 4min 43s Wall time: 5min In [43]: %time y = np.loadtxt('data.txt') CPU times: user 6min 31s, sys: 1min, total: 7min 31s Wall time: 7min 44s # Then we proceed to use dask to read the big file. The key here is to # use a block size so we process the file in ~120Mb chunks (approx. one line). # Dask uses by default the line separator \n to ensure the partitions don't break # the lines. In [1]: import dask.bag In [2]: data = dask.bag.read_text('data.txt', blocksize=120*1024*1024) In [3]: data dask.bag<bag-fro..., npartitions=60> # Rather than passing the entire 100+Mb line to np.loadtxt, we slice the first 128 bytes # which is enough to grab the first 4 columns. # You could further speed up this by not reading the entire line but instead read just # 128 bytes from each line offset. In [4]: from io import StringIO In [5]: def to_array(line): ...: return np.loadtxt(StringIO(line[:128]))[:4] ...: ...: In [6]: %time y = np.asarray(data.map(to_array).compute()) y.shape CPU times: user 190 ms, sys: 60.8 ms, total: 251 ms Wall time: 16.9 s In [7]: y.shape (60, 4) In [8]: y[:2, :] array([[ 0.17329305, 0.36584998, 0.01356046, 0.6814617 ], [ 0.3352684 , 0.83274823, 0.24399607, 0.30103352]]) You can also use dask to convert the entire file to hdf5. Regards, [1] http://dask.pydata.org/ Rolando On Wed, Nov 30, 2016 at 1:16 PM, Heli <heml...@gmail.com> wrote: > Hi all, > > Writing my ASCII file once to either of pickle or npy or hdf data types > and then working afterwards on the result binary file reduced the read time > from 80(min) to 2 seconds. > > Thanks everyone for your help. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list