I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included?
On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <chris.bar...@noaa.gov> wrote: > On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.face...@gmail.com > > wrote: > >> Is there any reason not to support all Unicode encodings that python >> does, with the same names and semantics? This would surely be the simplest >> to understand. >> > > I think it should support all fixed-length encodings, but not the > non-fixed length ones -- they just don't fit well into the numpy data model. > > >> Also, if latin1 is to going to be the only practical 8-bit encoding, >> maybe check with some non-Western users to make sure it's not going to >> wreck their lives? I'd have selected ASCII as an encoding to treat >> specially, if any, because Unicode already does that and the consequences >> are familiar. (I'm used to writing and reading French without accents >> because it's passed through ASCII, for example.) >> > > latin-1 (or latin-9) only makes things better than ASCII -- it buys most > of the accented characters for the European language and some symbols that > are nice to have (I use the degree symbol a lot...). And it is ASCII > compatible -- so there is NO reason to choose ASCII over Latin-* > > Which does no good for non-latin languages -- so we need to hear from the > community -- is there a substantial demand for a non-latin one-byte per > character encoding? > > >> Variable-length encodings, of which UTF-8 is obviously the one that makes >> good handling essential, are indeed more complicated. But is it strictly >> necessary that string arrays hold fixed-length *strings*, or can the >> encoding length be fixed instead? That is, currently if you try to assign a >> longer string than will fit, the string is truncated to the number of >> characters in the data type. >> > > we could do that, yes, but an improperly truncated "string" becomes > invalid -- just seems like a recipe for bugs that won't be found in testing. > > memory is cheap, compressing is fast -- we really shouldn't get hung up on > this! > > Note: if you are storing a LOT of text (which I have no idea why you would > use numpy anyway), then the memory size might matter, but then > semi-arbitrary truncation would probably matter, too. > > I expect most text storage in numpy arrays is things like names of > datasets, ids, etc, etc -- not massive amounts of text -- so storage space > really isn't critical. but having an id or something unexpectedly truncated > could be bad. > > I think practical experience has shown us that people do not handle > "mostly fixed length but once in awhile not" text well -- see the nightmare > of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so > errors are far more likely to be found in tests (why would you use utf-8 is > all your data are in ascii???). but still -- why invite hard to test for > errors? > > Final point -- as Julian suggests, one reason to support utf-8 is for > interoperability with other systems -- but that makes errors more of an > issue -- if it doesn't pass through the numpy truncation machinery, invalid > data could easily get put in a numpy array. > > -CHB > > it would allow UTF-8 to be used just the way it usually is - as an >> encoding that's almost 8-bit. >> > > ouch! that perception is the route to way too many errors! it is by no > means almost 8-bit, unless your data are almost ascii -- in which case, use > latin-1 for pity's sake! > > This highlights my point though -- if we support UTF-8, people WILL use > it, and only test it with mostly-ascii text, and not find the bugs that > will crop up later. > > All this said, it seems to me that the important use cases for string >> arrays involve interaction with existing binary formats, so people who have >> to deal with such data should have the final say. (My own closest approach >> to this is the FITS format, which is restricted by the standard to ASCII.) >> > > yup -- not sure we'll get much guidance here though -- netdf does not > solve this problem well, either. > > But if you are pulling, say, a utf-8 encoded string out of a netcdf file > -- it's probably better to pull it out as bytes and pass it through the > python decoding/encoding machinery than pasting the bytes straight to a > numpy array and hope that the encoding and truncation are correct. > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion