On Thu, Jul 6, 2017 at 1:49 AM, <paul.carr...@free.fr> wrote: > > Dear All > > First of all thanks for the answers and the information’s (I’ll ding into it) and let me trying to add comments on what I want to : > > My asci file mainly contains data (float and int) in a single column > (it is not always the case but I can easily manage it – as well I saw I can use ‘spli’ instruction if necessary) > Comments/texts indicates the beginning of a bloc immediately followed by the number of sub-blocs > So I need to read/record all the values in order to build a matrix before working on it (using Numpy & vectorization) > > The columns 2 and 3 have been added for further treatments > The ‘0’ values will be specifically treated afterward > > > Numpy won’t be a problem I guess (I did some basic tests and I’m quite confident) on how to proceed, but I’m really blocked on data records … I trying to find a way to efficiently read and record data in a matrix: > > avoiding dynamic memory allocation (here using ‘append’ in python meaning, not np),
Although you can avoid some list appending in your case (because the blocks self-describe their length), I would caution you against prematurely avoiding it. It's often the most natural way to write the code in Python, so go ahead and write it that way first. Once you get it working correctly, but it's too slow or memory intensive, then you can puzzle over how to preallocate the numpy arrays later. But quite often, it's fine. In this case, the reading and handling of the text data itself is probably the bottleneck, not appending to the lists. As I said, Python lists are cleverly implemented to make appending fast. Accumulating numbers in a list then converting to an array afterwards is a well-accepted numpy idiom. > dealing with huge asci file: the latest file I get contains more than 60 million of lines > > Please find in attachment an extract of the input format (‘example_of_input’), and the matrix I’m trying to create and manage with Numpy > > Thanks again for your time Try something like the attached. The function will return a list of blocks. Each block will itself be a list of numpy arrays, which are the sub-blocks themselves. I didn't bother adding the first three columns to the sub-blocks or trying to assemble them all into a uniform-width matrix by padding with trailing 0s. Since you say that the trailing 0s are going to be "specially treated afterwards", I suspect that you can more easily work with the lists of arrays instead. I assume floating-point data rather than trying to figure out whether int or float from the data. The code can handle multiple data values on one line (not especially well-tested, but it ought to work), but it assumes that the number of sub-blocks, index of the sub-block, and sub-block size are each on the own line. The code gets a little more complicated if that's not the case. -- Robert Kern
from __future__ import print_function import numpy as np def write_random_file(filename, n_blocks=42, n_elems=60*1000*1000): q, r = divmod(n_elems, n_blocks) block_lengths = [q] * n_blocks block_lengths[-1] += r with open(filename, 'w') as f: print('##BEGIN', file=f) print(n_blocks, file=f) for i, block_length in enumerate(block_lengths, 1): print(i, file=f) print(block_length, file=f) block = np.random.randint(0, 1000, size=block_length) for x in block: print(x, file=f) def read_blocked_file(filename): blocks = [] with open(filename, 'r') as f: # Loop over all blocks. while True: # Consume lines until the start of the next block. # Unfortunately, we can't use `for line in f:` because we need to # use `f.readline()` later. line = f.readline() found_block = True while '##BEGIN' not in line: line = f.readline() if line == '': # We've reached the end of the file. found_block = False break if not found_block: # We iterated to the end of the file. Break out of the `while` # loop. break # Read the number of sub-blocks. # This assumes that it is on a line all by itself. n_subblocks = int(f.readline()) subblocks = [] for i_subblock in range(1, n_subblocks + 1): read_i_subblock = int(f.readline()) # These ought to match. if read_i_subblock != i_subblock: raise RuntimeError("Mismatched sub-block index") # Read the size of the sub-block. subblock_size = int(f.readline()) # Allocate an array for the contents. subblock_data = np.empty(subblock_size, dtype=float) i = 0 while True: line = f.readline() # If there are multiple values on the line, handle them # intelligently. values = map(float, line.split()) # Note that if there was only one value, then `value` will # be a 1-element list. subblock_data[i:i+len(values)] = values i += len(values) if i > subblock_size: raise RuntimeError( "Oops! We read too many values for the sub-block!") if i == subblock_size: # We've reached the end of the sub-block. break # Note that we don't have the prefix columns or the trailing # padding zeros. These can all be added later, if they are # really needed. subblocks.append(subblock_data) blocks.append(subblocks) return blocks
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion