On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus <andrea.cimatori...@nioz.nl> wrote: > Hi everybody, I hope this has not been discussed before, I couldn't find a > solution elsewhere. > I need to read some binary data, and I am using numpy.fromfile to do this. > Since the files are huge, and would make me run out of memory, I need to read > data skipping some records (I am reading data recorded at high frequency, so > basically I want to read subsampling). > At the moment, I came up with the code below, which is then compiled using > cython. Despite the significant performance increase from the pure python > version, the function is still much slower than numpy.fromfile, and only > reads one kind of data (in this case uint32), otherwise I do not know how to > define the array type in advance. I have basically no experience with cython > nor c, so I am a bit stuck. How can I try to make this more efficient and > possibly more generic?
If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably # Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy() (Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...) -n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion