I found a way to convert any binary string to Unicode without crashing - http://stackoverflow.com/a/27527728/239247 That would correctly convert all `ascii` characters (and will probably make it possible to use ANSI graphics if unicode font supports that), but it will not work for other utf-8 characters.
Python 3 adds some surrogateescape, but that is not present in Python 2. http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 I don't know why they called it "surrogate" - it is a freaky word. On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L <[email protected]> wrote: > I would agree with this. > > > > In general the OS today store file data ( ie the file system data not the > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is not > always the case it could be big5 or some other locale encoding. On Linux > there are means to see what the “native” encoding is to use it. > > > > I should note that the idea of converting binary to Unicode does not really > exist. The point of a binary string to is to hold random data ( ie like a > double in the raw form 64-bit vs the dec values of 1.2385). One can assume > that it is a certain code page encoding and convert from that. And like I > stated above there are api to see what the locale code page encoding is and > that can be used to convert the code to the local ANSI/OEM encoding. This is > different from a binary string. > > > > Jason > > > > > > > > From: Scons-dev [mailto:[email protected]] On Behalf Of Gary > Oberbrunner > Sent: Wednesday, May 27, 2015 7:43 AM > To: SCons developer list > Subject: Re: [Scons-dev] Merge PR #235 before release > > > > > > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <[email protected]> > wrote: > > What I need is a bulletproof way to convert from anything to unicode. This > requires some kind of escaping to go forward and back. Some helper > methods like u2b() (unicode to binary) and b2u(). I am quite surprised that > so far I found nothing for this "simple" case. > > > That's because in general the encoding of the "binary" string is unknown. > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You > can't decode such a string to Unicode without knowing the encoding. Check > out the python-3 branch where we've been working through some of those > issues. Your u2b is "easy" if you assume you want the binary to be utf-8 > encoded, which is normally safe; this conversion is guaranteed to work. > Your b2u is not so easy. You can't just assume utf-8 as you might think; if > the string has invalid utf-8 bytes it'll raise an error or generate dummy > chars depending on the args you pass to str.decode(). At least it'll get > mangled if it's in a different encoding than you expect. > > > > -- > > Gary > > > _______________________________________________ > Scons-dev mailing list > [email protected] > https://pairlist2.pair.net/mailman/listinfo/scons-dev > -- anatoly t. _______________________________________________ Scons-dev mailing list [email protected] https://pairlist2.pair.net/mailman/listinfo/scons-dev
