Re: [Numpy-discussion] record data previous to Numpy use

Robert Kern Thu, 06 Jul 2017 02:53:13 -0700

On Thu, Jul 6, 2017 at 1:49 AM, <[email protected]> wrote:
>
> Dear All
>
> First of all thanks for the answers and the information’s (I’ll ding into
it) and let me trying to add comments on what I want to :
>
> My asci file mainly contains data (float and int) in a single column
> (it is not always the case but I can easily manage it – as well I saw I
can use ‘spli’ instruction if necessary)
> Comments/texts indicates the beginning of a bloc immediately followed by
the number of sub-blocs
> So I need to read/record all the values in order to build a matrix before
working on it (using Numpy & vectorization)
>
> The columns 2 and 3 have been added for further treatments
> The ‘0’ values will be specifically treated afterward
>
>
> Numpy won’t be a problem I guess (I did some basic tests and I’m quite
confident) on how to proceed, but I’m really blocked on data records … I
trying to find a way to efficiently read and record data in a matrix:
>
> avoiding dynamic memory allocation (here using ‘append’ in python
meaning, not np),


Although you can avoid some list appending in your case (because the blocks
self-describe their length), I would caution you against prematurely
avoiding it. It's often the most natural way to write the code in Python,
so go ahead and write it that way first. Once you get it working correctly,
but it's too slow or memory intensive, then you can puzzle over how to
preallocate the numpy arrays later. But quite often, it's fine. In this
case, the reading and handling of the text data itself is probably the
bottleneck, not appending to the lists. As I said, Python lists are
cleverly implemented to make appending fast. Accumulating numbers in a list
then converting to an array afterwards is a well-accepted numpy idiom.

> dealing with huge asci file: the latest file I get contains more than 60
million of lines
>
> Please find in attachment an extract of the input format
(‘example_of_input’), and the matrix I’m trying to create and manage with
Numpy
>
> Thanks again for your time

Try something like the attached. The function will return a list of blocks.
Each block will itself be a list of numpy arrays, which are the sub-blocks
themselves. I didn't bother adding the first three columns to the
sub-blocks or trying to assemble them all into a uniform-width matrix by
padding with trailing 0s. Since you say that the trailing 0s are going to
be "specially treated afterwards", I suspect that you can more easily work
with the lists of arrays instead. I assume floating-point data rather than
trying to figure out whether int or float from the data. The code can
handle multiple data values on one line (not especially well-tested, but it
ought to work), but it assumes that the number of sub-blocks, index of the
sub-block, and sub-block size are each on the own line. The code gets a
little more complicated if that's not the case.

--
Robert Kern

from __future__ import print_function

import numpy as np


def write_random_file(filename, n_blocks=42, n_elems=60*1000*1000):
    q, r = divmod(n_elems, n_blocks)
    block_lengths = [q] * n_blocks
    block_lengths[-1] += r
    with open(filename, 'w') as f:
        print('##BEGIN', file=f)
        print(n_blocks, file=f)
        for i, block_length in enumerate(block_lengths, 1):
            print(i, file=f)
            print(block_length, file=f)
            block = np.random.randint(0, 1000, size=block_length)
            for x in block:
                print(x, file=f)


def read_blocked_file(filename):
    blocks = []
    with open(filename, 'r') as f:
        # Loop over all blocks.
        while True:
            # Consume lines until the start of the next block.
            # Unfortunately, we can't use `for line in f:` because we need to
            # use `f.readline()` later.
            line = f.readline()
            found_block = True
            while '##BEGIN' not in line:
                line = f.readline()
                if line == '':
                    # We've reached the end of the file.
                    found_block = False
                    break
            if not found_block:
                # We iterated to the end of the file. Break out of the `while`
                # loop.
                break

            # Read the number of sub-blocks.
            # This assumes that it is on a line all by itself.
            n_subblocks = int(f.readline())
            subblocks = []
            for i_subblock in range(1, n_subblocks + 1):
                read_i_subblock = int(f.readline())
                # These ought to match.
                if read_i_subblock != i_subblock:
                    raise RuntimeError("Mismatched sub-block index")
                # Read the size of the sub-block.
                subblock_size = int(f.readline())
                # Allocate an array for the contents.
                subblock_data = np.empty(subblock_size, dtype=float)
                i = 0
                while True:
                    line = f.readline()
                    # If there are multiple values on the line, handle them
                    # intelligently.
                    values = map(float, line.split())
                    # Note that if there was only one value, then `value` will
                    # be a 1-element list.
                    subblock_data[i:i+len(values)] = values
                    i += len(values)
                    if i > subblock_size:
                        raise RuntimeError(
                            "Oops! We read too many values for the sub-block!")
                    if i == subblock_size:
                        # We've reached the end of the sub-block.
                        break
                # Note that we don't have the prefix columns or the trailing
                # padding zeros. These can all be added later, if they are
                # really needed.
                subblocks.append(subblock_data)
            blocks.append(subblocks)
    return blocks

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] record data previous to Numpy use

Reply via email to