On Wed, Dec 7, 2011 at 7:50 PM, Chris.Barker <[email protected]> wrote:
> Hi folks, > > This is a continuation of a conversation already started, but i gave it > a new, more appropriate, thread and subject. > > On 12/6/11 2:13 PM, Wes McKinney wrote: > > we should start talking > > about building a *high performance* flat file loading solution with > > good column type inference and sensible defaults, etc. > ... > > > I personally don't > > believe in sacrificing an order of magnitude of performance in the 90% > > case for the 10% case-- so maybe it makes sense to have two functions > > around: a superfast custom CSV reader for well-behaved data, and a > > slower, but highly flexible, function like loadtable to fall back on. > > I've wanted this for ages, and have done some work towards it, but like > others, only had the time for a my-use-case-specific solution. A few > thoughts: > > * If we have a good, fast ascii (or unicode?) to array reader, hopefully > it could be leveraged for use in the more complex cases. So that rather > than genfromtxt() being written from scratch, it would be a wrapper > around the lower-level reader. > You seem to be contradicting yourself here. The more complex cases are Wes' 10% and why genfromtxt is so hairy internally. There's always a trade-off between speed and handling complex corner cases. You want both. A very fast reader for well-behave files would be very welcome, but I see it as a separate topic from genfromtxt/loadtable. The question for the loadtable pull request is whether it is different enough from genfromtxt that we need/want both, or whether loadtable should replace genfromtxt. Cheers, Ralf > * key to performance is to have the text to number to numpy type > happening in C -- if you read the text with python, then convert to > numbers, then to numpy arrays, it's simple going to be slow. > > * I think we want a solution that can be adapted to arbitrary text files > -- not just tabular, CSV-style data. I have a lot of those to read - and > some thoughts about how. > > Efforts I have made so far, and what I've learned from them: > > 1) fromfile(): > fromfile (for text) is nice and fast, but buggy, and a bit too > limited. I've posted various notes about this in the past (and, I'm > pretty sure a couple tickets). They key missing features are: > a) no support form commented lines (this is a lessor need, I think) > b) there can be only one delimiter, and newlines are treated as > generic whitespace. What this means is that if you have > whitespace-delimited file, you can read multiple lines, but if it is, > for instance, comma-delimited, then you can only read one line at a > time, killing performance. > c) there are various bugs if the text is malformed, or doesn't quite > match what you're asking for (ie.e reading integers, but the tet is > float) -- mostly really limited error checking. > > I spent some time digging into the code, and found it to be really hard > to track C code. And very hard to update. The core idea is pretty nice > -- each dtype should know how to read itself form a text file, but the > implementation is painful. The key issue is that for floats and ints, > anyway, it relies on the C atoi and atof functions. However, there have > been patches to these that handle NaN better, etc, for numpy, and I > think a python patch as well. So the code calls the numpy atoi, which > does some checks, then calls the python atoi, which then calls the C lib > atoi (I think all that...) In any case, the core bugs are due to the > fact that atoi and friends doesn't return an error code, so you have to > check if the pointer has been incremented to see if the read was > successful -- this error checking is not propagated through all those > levels of calls. It got really ugly to try to fix! Also, the use of the > C atoi() means that locales may only be handled in the default way -- > i.e. no way to read european-style floats on a system with a US locale. > > My conclusion -- the current code is too much a mess to try to deal with > and fix! > > I also think it's a mistake to have text file reading a special case of > fromfile(), it really should be a separate issue, though that's a minor > API question. > > 2) FileScanner: > > FileScanner is some code a wrote years ago as a C extension - it's > limited, but does the job and is pretty fast. It essentially calls > fscanf() as many times as it gets a successful scan, skipping all > invalid text, then returning a numpy array. You can also specify how > many numbers you want read from the file. It only supports floats. > Travis O. asked it it could be included in Scipy way back when, but I > suspect none of my code actually made it in. > > If I had to do it again, I might write something similar in Cython, > though I am still using it. > > > My Conclusions: > > I think what we need is something similar to MATLAB's fscanf(): > > what it does is take a C-style format string, and apply it to your file > over an over again as many times as it can, and returns an array. What's > nice about this is that it can be purposed to efficiently read a wide > variety of text files fast. > > For numpy, I imagine something like: > > fromtextfile(f, dtype=np.float64, comment=None, shape=None): > """ > read data from a text file, returning a numpy array > > f: is a filename or file-like object > > comment: is a string of the comment signifier. Anything on a line > after this string will be ignored. > > dytpe: is a numpy dtype that you want read from the file > > shape: is the shape of the resulting array. If shape==None, the > file will be read until EOF or until there is read error. > By default, if there are newlines in the file, a 2-d array > will be returned, with the newline signifying a new row in > the array. > """ > > This is actually pretty straightforward. If it support compound dtypes, > then you can read a pretty complex CSV file, once you've determined the > dtype for your "record" (row). It is also really simple to use for the > simple cases. > > But of course, the implementation could be a pain -- I've been thinking > that you could get a lot of it by creating a mapping from numpy dtypes > to fscanf() format strings, then simply use fscanf for the actual file > reading. This would certainly be easy for the easy cases. (maybe you'd > want to use sscanf, so you could have the same code scan strings as well > as files) > > Ideally, each dtype would know how to read itself from a string, but as > I said above, the code for that is currently pretty ugly, so it may be > easier to keep it separate. > > Anyway, I'd be glad to help with this effort. > > -Chris > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > [email protected] > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
