I see that pytables deals with hdf5 data. It would be very nice if the data 
were in such a standard format, but that is not the case, and that can't be 
changed.

________________________________________
Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Frédéric Bastien [no...@nouiz.org]
Inviato: mercoledì 13 marzo 2013 15.03
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

Hi,

I would suggest that you look at pytables[1]. It use a different file
format, but it seam to do exactly what you want and give an object
that have a very similar interface to numpy.ndarray (but fewer
function). You would just ask for the slice/indices that you want and
it return you a numpy.ndarray.

HTH

Frédéric

[1] http://www.pytables.org/moin

On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith <n...@pobox.com> wrote:
> On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus
> <andrea.cimatori...@nioz.nl> wrote:
>> Hi everybody, I hope this has not been discussed before, I couldn't find a 
>> solution elsewhere.
>> I need to read some binary data, and I am using numpy.fromfile to do this. 
>> Since the files are huge, and would make me run out of memory, I need to 
>> read data skipping some records (I am reading data recorded at high 
>> frequency, so basically I want to read subsampling).
>> At the moment, I came up with the code below, which is then compiled using 
>> cython. Despite the significant performance increase from the pure python 
>> version, the function is still much slower than numpy.fromfile, and only 
>> reads one kind of data (in this case uint32), otherwise I do not know how to 
>> define the array type in advance. I have basically no experience with cython 
>> nor c, so I am a bit stuck. How can I try to make this more efficient and 
>> possibly more generic?
>
> If your data is stored as fixed-format binary (as it seems it is),
> then the easiest way is probably
>
> # Exploit the operating system's virtual memory manager to get a
> "virtual copy" of the entire file in memory
> # (This does not actually use any memory until accessed):
> virtual_arr = np.memmap(path, np.uint32, "r")
> # Get a numpy view onto every 20th entry:
> virtual_arr_subsampled = virtual_arr[::20]
> # Copy those bits into regular malloc'ed memory:
> arr_subsampled = virtual_arr_subsampled.copy()
>
> (Your data is probably large enough that this will only work if you're
> using a 64-bit system, because of address space limitations; but if
> you have data that's too large to fit into memory, then I assume
> you're using a 64-bit system anyway...)
>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to