Re: [Pytables-users] Extracting from a large compressed binary datafile into a PyTables array

Ivan Vilata i Balaguer Fri, 02 Nov 2007 04:13:05 -0800

Anand Patil (el 2007-10-31 a les 17:53:17 -0700) va dir::

> I have a file full of 32-bit floats, in binary format, compressed with zip.
> I'd like to get it into a PyTables array, but this:
> 
>     Z = ZipFile('data_file.zip')
>     binary_data = Z.read('data_file')
>     numpy_array = numpy.fromstring(data, dtype=float32)
>     h5file.createArray('/', 'data', numpy_array)
> 
> won't work because I don't have enough memory for the intermediate stages.
> Is there an easy way to do this piece-by-piece or in a 'streaming' fashion?


First of all I'd avoid using an ``Array`` object for storing such a big
array.  ``CArray`` or ``EArray`` objects are more suited for that, since
they are chunked so they are a lot more memory-efficient.  Both allow
you to store your data little by little, since disk space is only
allocated for a chunk when really needed.  The first ones have a fixed
shape, while the second ones are enlargeable.

I guess the big obstacle would be to extract data from the zip file
incrementally.  Since the ``ZipFile`` interface doesn't allow this, you
may unzip ``data_file`` to disk, then open it and read chunks of data
from it.  Something like this:

    nptype = numpy.float32
    atom = tables.Atom.from_sctype(nptype)

    extract data_file from data_file.zip (e.g. with subprocess)
    total_rows = size of data_file / atom.itemsize (e.g. with stat)

    array = h5file.createCArray( '/', 'data', atom,
                                 shape=(total_rows,) )
    # or
    array = h5file.createEArray( '/', 'data', atom,
                                 shape=(0,), expectedrows=total_rows )
    # We will be reading blocks as big as a chunk.
    rows_to_read = array.chunkshape[0]
    bytes_to_read = rows_to_read * atom.itemsize

    dfile = open('data_file', 'b')
    data = dfile.read(bytes_to_read)
    base = 0  # only for CArray
    while data:
        arr = numpy.fromstring(data, dtype=nptype)
        # CArray case
        array[base:base+len(arr)] = arr
        base += len(arr)
        # EArray case
        array.append(arr)
        data = dfile.read(bytes_to_read)
    array.flush()
    dfile.close()

This is untested, but I hope you get the idea.

Cheers,

::

        Ivan Vilata i Balaguer   >qo<   http://www.carabos.com/
               Cárabos Coop. V.  V  V   Enjoy Data
                                  ""

signature.asc
Description: Digital signature

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting from a large compressed binary datafile into a PyTables array

Reply via email to