Re: [Numpy-discussion] Fast Reading of ASCII files

Bruce Southey Tue, 13 Dec 2011 11:30:46 -0800

On 12/13/2011 12:08 PM, Chris Barker wrote:

NOTE:
Let's keep this on the list.
On Tue, Dec 13, 2011 at 9:19 AM, denis <[email protected]<mailto:[email protected]>> wrote:
    Chris,
     unified, consistent save / load is a nice goal

    1) header lines with date, pwd etc.: "where'd this come from ?"

       # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
       # 80.6 % correct -- user info
         245    39     4     5    26
       ...
I'm not sure I understand what you are expecting here: What would beautomatic? if itparses a datetime on the header, what would it do withit? But anyway, this seems to me:
  - very application specific -- this is for the users code to write
- not what we are talking about at this point anyway -- I think thisdiscussion is about a lower-level, does-the-simple-things-fast reader-- that may or may not be able to form the basis of a higher-levelfuller featured reader.
    2) read any CSVs: comma or blank-delimited, with/without column names,
       a la loadcsv() below
yup -- though the column name reading would be part of a higher-levelreader as far as I'm concerned.
    3) sparse or masked arrays ?
sparse probably not, that seem pretty domain dependent to me -- thoughhopefully one could build such a thing on top of the lower levelreader. Masked support would be good -- once we're convinced what thefuture of masked arrays are in numpy. I was thinking that the maskedarray issue would really be a higher-level feature -- it certainlycould be if you need to mask "special value" stype files (i.e. 9999),but we may have to build it into the lower level reader for caseswhere the mask is specified by non-numerical values -- i.e. there aresome met files that use "MM" or some other text, so you can't put itinto a numerical array first.
    Longterm wishes: beyond the scope of one file <-> one array
    but essential for larger projects:
    1) dicts / dotdicts:
       Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
    files
       is easy, better than np.savez
       (Haven't used hdf5, I believe Matlabv7  does.)

    2) workflows: has anyone there used visTrails ?


outside of the spec of this thread...


    Anyway it seems to me (old grey cynic) that Numpy/scipy developers
    prefer to code first, spec and doc later. Too pessimistic ?
Well, I think many of us believe in a more agile style approach --incremental development. But really, as an open source project, it'sreally about scratching an itch -- so there is usually a spec in mindfor the itch at hand. In this case, however, that has been a weakness-- clearly a number of us hav written small solutions toour particular problem at hand, but no we haven't arrived at a moregeneral purpose solution yet. So a bit of spec-ing ahead of time maybe called for.
On that:
I"ve been thinking from teh botom-up -- imaging what I need for thesimple case, and how it might apply to more complex cases -- but maybewe should think about this another way:
What we're talking about here is really about core softwareengineering -- optimization. It's easy to write a pure-python simplefile parser, and reasonable to write a complex one (genfromtxt) -- theissue is performance -- we need some more C (or Cython) code to reallyspeed it up, but none of us wants to write the complex case code in C. So:
genfromtxt is really nice for many of the complex cases. So perhapsanother approach is to look at genfromtxt, and see whathigh performance lower-level functionality we could develop that couldmake it fast -- then we are done.
This actually mirrors exactly what we all usually recommend for pythondevelopment in general -- write it in Python, then, if it's really notfast enough, write the bottle-neck in C.
So where are the bottle necks in genfromtxt? Are there self-containedportions that could be re-written in C/Cython?
-Chris

Reading data is hard and writing code that suits the diversity in theNumerical Python community is even harder!

Both loadtxt and genfromtxt functions (other functions are perhaps lessimportant) perhaps need an upgrade to incorporate the new NA object. Ithink that adding the NA object will simply some of the process becauseinvalid data (missing or a string in a numerical format) can be set toNA without requiring the creation of a new masked array or returning anerror.

Here I think loadtxt is a better target than genfromtxt because, as Iunderstand it, it assumes the user really knows the data. Whereasgenfromtxt can ask the data for the appropriatye format.

So I agree that new 'superfast custom CSV reader for well-behaved data'function would be rather useful especially as an replacement forloadtxt. By that I mean reading data using a user specified format thatessentially follows the CSV format(http://en.wikipedia.org/wiki/Comma-separated_values) - it needs are toallow for NA object, skipping lines and user-defined delimiters.


Bruce

_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Reading of ASCII files

Reply via email to