On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern <robert.k...@gmail.com> wrote:
> > My question: What are those non-ASCII characters? How often are they > truly latin-1/9 vs. some other text encoding vs. non-string binary data? > > I don't know that we can reasonably make that accounting relevant. Number > of such characters per byte of text? Number of files with such characters > out of all existing files? > I have a lot of mostly english -- usually not latin-1, but usually mostly latin-1. -- the non-ascii characters are a handful of accented characters (usually from spanish, some french), then a few "scientific" characters: the degree symbol, the "micro" symbol. I suspect that this is not an unusual pattern for mostly-english scientific text. if it's non-string binary data, I know it -- and I'd use a bytes type. I have two options -- try to detect the encoding properly or use _something_ and fix it up later. latin-1 is a great choice for the later option -- most of the text displays fine, and the wrong stuff is untouched, so I can figure it out. What I can say with assurance is that every time I have decided, as a > developer, to write code that just hardcodes latin-1 for such cases, I have > regretted it. While it's just personal anecdote, I think it's at least > measuring the right thing. :-) > I've had the opposite experience -- so that's two anecdotes :-) If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but not really much worse then any other option other than properly decoding it. IN a way, using latin-1 is like the old py2 string -- it can be used as text, even if it has arbitrary non-text garbage in it... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion