Re: [Numpy-discussion] Fast Reading of ASCII files

Ralf Gommers Sun, 11 Dec 2011 08:40:12 -0800

On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <[email protected]> wrote:


> Hi folks,
>
> This is a continuation of a conversation already started, but i gave it
> a new, more appropriate, thread and subject.
>
> On 12/6/11 2:13 PM, Wes McKinney wrote:
> > we should start talking
> > about building a *high performance* flat file loading solution with
> > good column type inference and sensible defaults, etc.
> ...
>
> >  I personally don't
> > believe in sacrificing an order of magnitude of performance in the 90%
> > case for the 10% case-- so maybe it makes sense to have two functions
> > around: a superfast custom CSV reader for well-behaved data, and a
> > slower, but highly flexible, function like loadtable to fall back on.
>
> I've wanted this for ages, and have done some work towards it, but like
> others, only had the time for a my-use-case-specific solution. A few
> thoughts:
>
> * If we have a good, fast ascii (or unicode?) to array reader, hopefully
> it could be leveraged for use in the more complex cases. So that rather
> than genfromtxt() being written from scratch, it would be a wrapper
> around the lower-level reader.
>

You seem to be contradicting yourself here. The more complex cases are Wes'
10% and why genfromtxt is so hairy internally. There's always a trade-off
between speed and handling complex corner cases. You want both.

A very fast reader for well-behave files would be very welcome, but I see
it as a separate topic from genfromtxt/loadtable. The question for the
loadtable pull request is whether it is different enough from genfromtxt
that we need/want both, or whether loadtable should replace genfromtxt.

Cheers,
Ralf



> * key to performance is to have the text to number to numpy type
> happening in C -- if you read the text with python, then convert to
> numbers, then to numpy arrays, it's simple going to be slow.
>
> * I think we want a solution that can be adapted to arbitrary text files
> -- not just tabular, CSV-style data. I have a lot of those to read - and
> some thoughts about how.
>
> Efforts I have made so far, and what I've learned from them:
>
> 1) fromfile():
>     fromfile (for text) is nice and fast, but buggy, and a bit too
> limited. I've posted various notes about this in the past (and, I'm
> pretty sure a couple tickets). They key missing features are:
>   a) no support form commented lines (this is a lessor need, I think)
>   b) there can be only one delimiter, and newlines are treated as
> generic whitespace. What this means is that if you have
> whitespace-delimited file, you can read multiple lines, but if it is,
> for instance, comma-delimited, then you can only read one line at a
> time, killing performance.
>   c) there are various bugs if the text is malformed, or doesn't quite
> match what you're asking for (ie.e reading integers, but the tet is
> float) -- mostly really limited error checking.
>
> I spent some time digging into the code, and found it to be really hard
> to track C code. And very hard to update. The core idea is pretty nice
> -- each dtype should know how to read itself form a text file, but the
> implementation is painful. The key issue is that for floats and ints,
> anyway, it relies on the C atoi and atof functions. However, there have
> been patches to these that handle NaN better, etc, for numpy, and I
> think a python patch as well. So the code calls the numpy atoi, which
> does some checks, then calls the python atoi, which then calls the C lib
> atoi (I think all that...) In any case, the core bugs are due to the
> fact that atoi and friends doesn't return an error code, so you have to
> check if the pointer has been incremented to see if the read was
> successful -- this error checking is not propagated through all those
> levels of calls. It got really ugly to try to fix! Also, the use of the
> C atoi() means that locales may only be handled in the default way --
> i.e. no way to read european-style floats on a system with a US locale.
>
> My conclusion -- the current code is too much a mess to try to deal with
> and fix!
>
> I also think it's a mistake to have text file reading a special case of
> fromfile(), it really should be a separate issue, though that's a minor
> API question.
>
> 2) FileScanner:
>
> FileScanner is some code a wrote years ago as a C extension - it's
> limited, but does the job and is pretty fast. It essentially calls
> fscanf() as many times as it gets a successful scan, skipping all
> invalid text, then returning a numpy array. You can also specify how
> many numbers you want read from the file. It only supports floats.
> Travis O. asked it it could be included in Scipy way back when, but I
> suspect none of my code actually made it in.
>
> If I had to do it again, I might write something similar in Cython,
> though I am still using it.
>
>
> My Conclusions:
>
> I think what we need is something similar to MATLAB's fscanf():
>
> what it does is take a C-style format string, and apply it to your file
> over an over again as many times as it can, and returns an array. What's
> nice about this is that it can be purposed to efficiently read a wide
> variety of text files fast.
>
> For numpy, I imagine something like:
>
> fromtextfile(f, dtype=np.float64, comment=None, shape=None):
>    """
>    read data from a text file, returning a numpy array
>
>    f: is a filename or file-like object
>
>    comment: is a string of the comment signifier. Anything on a line
>             after this string will be ignored.
>
>    dytpe: is a numpy dtype that you want read from the file
>
>    shape: is the shape of the resulting array. If shape==None, the
>           file will be read until EOF or until there is read error.
>           By default, if there are newlines in the file, a 2-d array
>           will be returned, with the newline signifying a new row in
>           the array.
>    """
>
> This is actually pretty straightforward. If it support compound dtypes,
> then you can read a pretty complex CSV file, once you've determined the
> dtype for your "record" (row). It is also really simple to use for the
> simple cases.
>
> But of course, the implementation could be a pain -- I've been thinking
> that you could get a lot of it by creating a mapping from numpy dtypes
> to fscanf() format strings, then simply use fscanf for the actual file
> reading. This would certainly be easy for the easy cases. (maybe you'd
> want to use sscanf, so you could have the same code scan strings as well
> as files)
>
> Ideally, each dtype would know how to read itself from a string, but as
> I said above, the code for that is currently pretty ugly, so it may be
> easier to keep it separate.
>
> Anyway, I'd be glad to help with this effort.
>
> -Chris
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> [email protected]
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Reading of ASCII files

Reply via email to