On 24 January 2014 01:09, Chris Barker <chris.bar...@noaa.gov> wrote: > On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin <oscar.j.benja...@gmail.com> > wrote: >> >> On 23 January 2014 21:51, Chris Barker <chris.bar...@noaa.gov> wrote: >> > >> > However, I would prefer latin-1 -- that way you might get garbage for >> > the >> > non-ascii parts, but it wouldn't raise an exception and it round-trips >> > through encoding/decoding. And you would have a somewhat more useful >> > subset >> > -- including the latin-language character and symbols like the degree >> > symbol, etc. >> >> Exceptions and error messages are a good thing! Garbage is not!!! :) > > in principle, I agree with you, but sometime practicality beets purity. > > in py2 there is a lot of implicit encoding/decoding going on, using the > system encoding. That is ascii on a lot of systems. The result is that there > is a lot of code out there that folks have ported to use unicode, but missed > a few corners. If that code is only testes with ascii, it all seems o be > working but then out in the wild someone puts another character in there and > presto -- a crash.
Precisely. The Py3 text model uses TypeErrors to warn early against this kind of thing. No longer do you have code that seems to work until the wrong character goes in. You get the error straight away when you try to mix bytes and text. You still have the option to silence those errors: it just needs to be done explicitly: >>> s = 'Ă•scar' >>> s.encode('ascii', errors='replace') b'?scar' > Also, there are places where the inability to encode makes silent message -- > for instance if an Exception is raised with a unicode message, it will get > silently dropped when it comes time to display on the terminal. I spent > quite a wile banging my head against that one recently when I tried to > update some code to read unicode files. I would have been MUCH happier with > a bit of garbage in the mesae than having it drop (or raise an encoding > error in the middle of the error...) Yeah, that's just a bug in CPython. I think it's fixed now but either way you're right: for the particular case of displaying error messages the interpreter should do whatever it takes to get some kind of error message out even if it's a bit garbled. I disagree that this should be the basis for ordinary data processing with numpy though. > I think this is a bad thing. > > The advantage of latin-1 is that while you might get something that doesn't > print right, it won't crash, and it won't contaminate the data, so > comparisons, etc, will still work. kind of like using utf-8 in an old-style > c char array -- you can still passi t around and copare it, even if the > bytes dont mean what you think they do. It round trips okay as long as you don't try to do anything else with the string. So does the textarray class I proposed in a new thread: If you just use fromfile and tofile it works fine for any input (except for trailing nulls) but if you try to decode invalid bytes it will throw errors. It wouldn't be hard to add configurable error-handling there either. Oscar _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion