On Thu, Jan 7, 2010 at 11:10 PM, Bruce Southey <bsout...@gmail.com> wrote: > On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker > <chris.bar...@noaa.gov> wrote: >> Bruce Southey wrote: >>>> <chris.bar...@noaa.gov> wrote: >> >>> Using the numpy NaN or similar (noting R's approach to missing values >>> which in turn allows it to have the above functionality) is just a >>> very bad idea for missing values because you always have to check that >>> which NaN is a missing value and which was due to some numerical >>> calculation. >> >> well, this is specific to reading files, so you know where it came from. > > You can only know where it came from when you compare the original > array to the transformed one. Also a user has to check for missing > values or numpy has to warn a user that missing values are present > immediately after reading the data so the appropriate action can be > taken (like using functions that handle missing values appropriately). > That is my second problem with using codes (NaN, -99999 etc) for > missing values. > > > >> And the principle of fromfile() is that it is fast and simple, if you >> want masked arrays, use slower, but more full-featured methods. > > So in that case it should fail with missing data. > >> >> However, in this case: >> >> In [9]: np.fromstring("3, 4, NaN, 5", sep=",") >> Out[9]: array([ 3., 4., NaN, 5.]) >> >> >> An actual NaN is read from the file, rather than a missing value. >> Perhaps the user does want the distinction, so maybe it should really >> only fil it in if the users asks for it, but specifying >> "missing_value=np.nan" or something. > > Yes, that is my first problem of using predefined codes for missing > values as you do not always know what is going to occur in the data. > > >> >>>>From what I can see is that you expect that fromfile() should only >>> split at the supplied delimiters, optionally(?) strip any whitespace >> >> whitespace stripping is not optional. >> >>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12' >>> actually assumes multiple delimiters because there is no comma between >>> 4 and 5 and 8 and 9. >> >> Yes, that's the point. I thought about allowing arbitrary multiple >> delimiters, but I think '/n' is a special case - for instance, a comma >> at the end of some numbers might mean missing data, but a '\n' would not. >> >> And I couldn't really think of a useful use-case for arbitrary multiple >> delimiters. >> >>> In Josef's last case how many 'missing values should there be? >> >> >> extra newlines at end of file >> >> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n' >> >> none -- exactly why I think \n is a special case. > > What about '\r' and '\n\r'?
Yes, I forgot about this, and it will be the most common case for Windows users like myself. I think \r should be stripped automatically, like in non-binary reading of files in python. > >> >> What about: >> >> extra newlines in the middle of the file >> >> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n' >> >> I think they should be ignored, but I hope I'm not making something that >> is too specific to my personal needs. > > Not really, it is more that I am being somewhat difficult to ensure I > understand what you actually need. > > My problem with this is that you are reading one huge 1-D array (that > you can resize later) rather than a 2-D array with rows and columns > (which is what I deal with). But I agree that you can have an option > to say treat '\n' or '\r' as a delimiter but I think it should be > turned off by default. > > >> >> Travis Oliphant wrote: >>> +1 (ignoring new-lines transparently is a nice feature). You can also >>> use sscanf with weave to read most files. >> >> right -- but that requires weave. In fact, MATLAB has a fscanf function >> that allows you to pass in a C format string and it vectorizes it to use >> the same one over an over again until it's done. It's actually quite >> powerful and flexible. I once started with that in mind, but didn't have >> the C chops to do it. I ended up with a tool that only did doubles (come >> to think of it, MATLAB only does doubles, anyway...) >> >> I may some day write a whole new C (or, more likely, Cython) function >> that does something like that, but for now, I'm jsut trying to get >> fromfile to be useful for me. >> >> >>> +1 (much preferrable to insert NaN or other user value than raise >>> ValueError in my opinion) >> >> But raise an error for integer types? >> >> I guess this is still up the air -- no consensus yet. >> >> Thanks, >> >> -Chris >> > > You should have a corresponding value for ints because raising an > exceptionwould be inconsistent with allowing floats to have a value. No, I think different nan/missing value handling between integers and float is a natural distinction. There is no default nan code for integers, but nan (and inf) are valid floating point numbers (even if nan is not a number). And the default treatment of nans in numpy is getting pretty good (e.g. I like the new (nan)sort). > If you must keep the user defined dtype then, as Josef suggests, just > use some code be it -999 or most negative number supported by the OS > for the defined dtype or, just convert the ints into floats if the > user does not define a missing value code. It would be nice to either > return the number of missing values or display a warning indicating > how many occurred. A warning would be good, but doing np.any(np.isnan(x)) or np.isnan(x).sum() on the result is always a good idea for a user when missing values are possibility. Josef > > Bruce > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion