On Sun, Jan 19, 2014 at 7:21 AM, Oscar Benjamin
<oscar.j.benja...@gmail.com>wrote:

> > long as numpy.loadtxt is explicitly documented as only working with
> > latin-1 encoded files (it currently isn't), there's no problem.
>
> Actually there is problem. If it explicitly specified the encoding as
> latin-1 when opening the file then it could document the fact that it
> works for latin-1 encoded files. However it actually uses the system
> default encoding to read the file


which is a really bad default -- oh well. Also, I don't think it was a
choice, at least not a well thought out one, but rather what fell out of
tryin gto make it "just work" on py3.

and then converts the strings to
> bytes with the as_bytes function that is hard-coded to use latin-1:
> https://github.com/numpy/numpy/blob/master/numpy/compat/py3k.py#L28
>
> So it only works if the system default encoding is latin-1 and the
> file content is white-space and newline compatible with latin-1.
> Regardless of whether the file itself is in utf-8 or latin-1 it will
> only work if the system default encoding is latin-1. I've never used a
> system that had latin-1 as the default encoding (unless you count
> cp1252 as latin-1).
>

even if it was a common default it would be a "bad idea". Fortunately (?),
so it really is broken, we can fix it without being too constrained by
backwards compatibility.

>
> > If it's supposed to work with other encodings (but the entire file is
> > still required to use a consistent encoding), then it just needs
> > encoding and errors arguments to fit the Python 3 text model (with
> > "latin-1" documented as the default encoding).
>
> This is the right solution. Have an encoding argument, document the
> fact that it will use the system default encoding if none is
> specified, and re-encode using the same encoding to fit any dtype='S'
> bytes column. This will then work for any encoding including the ones
> that aren't ASCII-compatible (e.g. utf-16).
>

Exactly, except I dont think the system encoding as a default is a good
choice. If there is a default MOST people will use it. And it will work for
a lot of their test code. Then it will break if the code is passed to a
system with a different default encoding, or a file comes from another
source in a different encoding. This is very, very likely. Far
more likely that files consistently being in the system encoding....


> > default behaviour, since passing something like
> > codecs.getdecoder("utf-8") as a column converter should do the right
> > thing.
>

that seems to work at the moment, actually, if done with care.

That's just getting silly IMO. If the file uses mixed encodings then I
> don't consider it a valid "text file" and see no reason for loadtxt to
> support reading it.


agreed -- that's just getting crazy -- the only use-case I can image is to
clean up a file that got moji-baked by some other process -- not really the
use case for loadtxt and friends.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to