On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin <oscar.j.benja...@gmail.com>wrote:
> > long as numpy.loadtxt is explicitly documented as only working with > > latin-1 encoded files (it currently isn't), there's no problem. > > Actually there is problem. If it explicitly specified the encoding as > latin-1 when opening the file then it could document the fact that it > works for latin-1 encoded files. However it actually uses the system > default encoding to read the file which is a really bad default -- oh well. Also, I don't think it was a choice, at least not a well thought out one, but rather what fell out of tryin gto make it "just work" on py3. and then converts the strings to > bytes with the as_bytes function that is hard-coded to use latin-1: > https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28 > > So it only works if the system default encoding is latin-1 and the > file content is white-space and newline compatible with latin-1. > Regardless of whether the file itself is in utf-8 or latin-1 it will > only work if the system default encoding is latin-1. I've never used a > system that had latin-1 as the default encoding (unless you count > cp1252 as latin-1). > even if it was a common default it would be a "bad idea". Fortunately (?), so it really is broken, we can fix it without being too constrained by backwards compatibility. > > > If it's supposed to work with other encodings (but the entire file is > > still required to use a consistent encoding), then it just needs > > encoding and errors arguments to fit the Python 3 text model (with > > "latin-1" documented as the default encoding). > > This is the right solution. Have an encoding argument, document the > fact that it will use the system default encoding if none is > specified, and re-encode using the same encoding to fit any dtype='S' > bytes column. This will then work for any encoding including the ones > that aren't ASCII-compatible (e.g. utf-16). > Exactly, except I dont think the system encoding as a default is a good choice. If there is a default MOST people will use it. And it will work for a lot of their test code. Then it will break if the code is passed to a system with a different default encoding, or a file comes from another source in a different encoding. This is very, very likely. Far more likely that files consistently being in the system encoding.... > > default behaviour, since passing something like > > codecs.getdecoder("utf-8") as a column converter should do the right > > thing. > that seems to work at the moment, actually, if done with care. That's just getting silly IMO. If the file uses mixed encodings then I > don't consider it a valid "text file" and see no reason for loadtxt to > support reading it. agreed -- that's just getting crazy -- the only use-case I can image is to clean up a file that got moji-baked by some other process -- not really the use case for loadtxt and friends. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com