On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus
<andrea.cimatori...@nioz.nl> wrote:
> Hi everybody, I hope this has not been discussed before, I couldn't find a 
> solution elsewhere.
> I need to read some binary data, and I am using numpy.fromfile to do this. 
> Since the files are huge, and would make me run out of memory, I need to read 
> data skipping some records (I am reading data recorded at high frequency, so 
> basically I want to read subsampling).
> At the moment, I came up with the code below, which is then compiled using 
> cython. Despite the significant performance increase from the pure python 
> version, the function is still much slower than numpy.fromfile, and only 
> reads one kind of data (in this case uint32), otherwise I do not know how to 
> define the array type in advance. I have basically no experience with cython 
> nor c, so I am a bit stuck. How can I try to make this more efficient and 
> possibly more generic?

If your data is stored as fixed-format binary (as it seems it is),
then the easiest way is probably

# Exploit the operating system's virtual memory manager to get a
"virtual copy" of the entire file in memory
# (This does not actually use any memory until accessed):
virtual_arr = np.memmap(path, np.uint32, "r")
# Get a numpy view onto every 20th entry:
virtual_arr_subsampled = virtual_arr[::20]
# Copy those bits into regular malloc'ed memory:
arr_subsampled = virtual_arr_subsampled.copy()

(Your data is probably large enough that this will only work if you're
using a 64-bit system, because of address space limitations; but if
you have data that's too large to fit into memory, then I assume
you're using a 64-bit system anyway...)

-n
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to