If you're interested in this problem, I suggest reading https://docs.python.org/2/howto/unicode.html which has all the details (including how to ignore decode errors), and of course check out the python3 branch of scons where a lot of unicode handling has been done (but much is still left to do iirc). I don't think pretending strings are in the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows CP1252 would probably give better results in some cases but you still need to ignore errors in the decode. And of course if the string actually is utf-8 with non-ascii chars, either of these encodings will return a string of the wrong length, not just wrong characters; and re-encoding it for output or storage will completely mangle it.
Of course we _can_ know the encoding of the filenames in the filesystem, that's what sys.getfilesystemencoding() is for (see the unicode link above). Reading file contents and handling stdout/stderr from SCons subprocesses is much more of a challenge. On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik <[email protected]> wrote: > I found a way to convert any binary string to Unicode without crashing - > http://stackoverflow.com/a/27527728/239247 That would correctly > convert all `ascii` characters (and will probably make it possible to use > ANSI graphics if unicode font supports that), but it will not work for > other > utf-8 characters. > > Python 3 adds some surrogateescape, but that is not present in Python 2. > > http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 > I don't know why they called it "surrogate" - it is a freaky word. > > On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L <[email protected]> > wrote: > > I would agree with this. > > > > > > > > In general the OS today store file data ( ie the file system data not the > > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is > not > > always the case it could be big5 or some other locale encoding. On Linux > > there are means to see what the “native” encoding is to use it. > > > > > > > > I should note that the idea of converting binary to Unicode does not > really > > exist. The point of a binary string to is to hold random data ( ie like a > > double in the raw form 64-bit vs the dec values of 1.2385). One can > assume > > that it is a certain code page encoding and convert from that. And like I > > stated above there are api to see what the locale code page encoding is > and > > that can be used to convert the code to the local ANSI/OEM encoding. > This is > > different from a binary string. > > > > > > > > Jason > > > > > > > > > > > > > > > > From: Scons-dev [mailto:[email protected]] On Behalf Of Gary > > Oberbrunner > > Sent: Wednesday, May 27, 2015 7:43 AM > > To: SCons developer list > > Subject: Re: [Scons-dev] Merge PR #235 before release > > > > > > > > > > > > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <[email protected]> > > wrote: > > > > What I need is a bulletproof way to convert from anything to unicode. > This > > requires some kind of escaping to go forward and back. Some helper > > methods like u2b() (unicode to binary) and b2u(). I am quite surprised > that > > so far I found nothing for this "simple" case. > > > > > > That's because in general the encoding of the "binary" string is unknown. > > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You > > can't decode such a string to Unicode without knowing the encoding. > Check > > out the python-3 branch where we've been working through some of those > > issues. Your u2b is "easy" if you assume you want the binary to be utf-8 > > encoded, which is normally safe; this conversion is guaranteed to work. > > Your b2u is not so easy. You can't just assume utf-8 as you might > think; if > > the string has invalid utf-8 bytes it'll raise an error or generate dummy > > chars depending on the args you pass to str.decode(). At least it'll get > > mangled if it's in a different encoding than you expect. > > > > > > > > -- > > > > Gary > > > > > > _______________________________________________ > > Scons-dev mailing list > > [email protected] > > https://pairlist2.pair.net/mailman/listinfo/scons-dev > > > > > > -- > anatoly t. > _______________________________________________ > Scons-dev mailing list > [email protected] > https://pairlist2.pair.net/mailman/listinfo/scons-dev > -- Gary
_______________________________________________ Scons-dev mailing list [email protected] https://pairlist2.pair.net/mailman/listinfo/scons-dev
