On Jan 7, 2010, at 2:32 PM, josef.p...@gmail.com wrote: > On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker > <chris.bar...@noaa.gov> wrote: >> Pauli Virtanen wrote: >>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti: >>> it also does odd things with spaces >>>> embedded in the separator: >>>> >>>> ", $ #" matches all of: ",$#" ", $#" ",$ #" >> >>> That's a documented feature: >> >> Fair enough. >> >> OK, I've written a patch that allows newlines to be interpreted as >> separators in addition to whatever is specified in sep. >> >> In the process of testing, I found again these issues, which are >> still >> marked as "needs decision". >> >> http://projects.scipy.org/numpy/ticket/883 >> >> In short: what to do with missing values? >> >> I'd like to address this bug, but I need a decision to do so. >> >> >> My proposal: >> >> Raise an ValueError with missing values. >> >> >> Justification: >> >> No function should EVER return data that is not there. Period. It is >> simply asking for hard to find bugs. Therefore: >> >> fromstring("3, 4,,5", sep=",") >> >> Should never, ever, return: >> >> array([ 3., 4., 0., 5.]) >> >> Which is what it does now. bad. bad. bad. >> >> >> >> >> Alternatives: >> >> A) Raising a ValueError is the easiest way to get "proper" >> behavior. >> Folks can use a more sophisticated file reader if they want missing >> values handled. I'm willing to contribute this patch. >> >> B) If the dtype is a floating point type, NaN could fill in the >> missing values -- a fine idea, but you can't use it for integers, and >> zero is a really bad replacement! >> >> C) The user could specify what they want filled in for missing >> values. This is a fine idea, though I'm not sure I want to take the >> time >> to impliment it. >> >> Oh, and this is a bug too, with probably the same solution: >> >> In [20]: np.fromstring("hjba", sep=',') >> Out[20]: array([ 0.]) >> >> In [26]: np.fromstring("34gytf39", sep=',') >> Out[26]: array([ 34.]) >> >> >> One more unresolved question: >> >> what should: >> >> np.fromstring("3, 4, 5,", sep=",") >> >> return? >> >> it currently returns: >> >> array([ 3., 4., 5.]) >> >> which seems a bit inconsitent with missing value handling. I also >> found >> a bug: >> >> In [6]: np.fromstring("3, 4, 5 , ", sep=",") >> Out[6]: array([ 3., 4., 5., 0.]) >> >> so if there is some extra whitespace in there, it does return a >> missing >> value. With my proposal, that wouldn't happen, but you might get an >> exception. I think you should, but it'll be easier to implement my >> "allow newlines" code if not. >> >> >> so, should I do (A) ? >> >> >> Another question: >> >> I've got a patch mostly working (except for the above issues) that >> will >> allow fromfile/string to read multiline non-whitespace separated >> data in >> one shot: >> >> >> In [15]: str >> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >> >> In [16]: np.fromstring(str, sep=',', allow_newlines=True) >> Out[16]: >> array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., >> 11., >> 12.]) >> >> >> I think this is a very helpful enhancement, and, as it is a new >> kwarg, >> backward compatible: >> >> 1) Might it be accepted for inclusion? >> >> 2) Is the name for the flag OK: "allow_newlines"? It's pretty >> explicit, >> but also long -- I used it for the flag name in the C code, too. >> >> 3) What C datatype should I use for a boolean flag? I used a char, >> but I >> don't know what the numpy standard is. >> >> >> -Chris >> >> > > I don't know much about this, just a few more test cases > > comma and newline > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12' > > extra comma at end of file > str = '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,' > > extra newlines at end of file > str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' > > It would be nice if these cases would go through without missing > values or exception, but I don't often have files that are clean > enough for fromfile().
+1 (ignoring new-lines transparently is a nice feature). You can also use sscanf with weave to read most files. > > I'm in favor of nan for missing values with floating point numbers. It > would make it easy to read correctly formatted csv files, even if the > data is not complete. +1 (much preferrable to insert NaN or other user value than raise ValueError in my opinion) -Travis _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion