Sounds like we should use sys.getfilesystemencoding(). +1 The only trick might be if you had several filesystems each of which had different filesystemencoding()... May just be best to back out the patch for now.
-Bill On Thu, May 28, 2015 at 3:54 AM, Gary Oberbrunner <[email protected]> wrote: > If you're interested in this problem, I suggest reading > https://docs.python.org/2/howto/unicode.html which has all the details > (including how to ignore decode errors), and of course check out the > python3 branch of scons where a lot of unicode handling has been done (but > much is still left to do iirc). I don't think pretending strings are in > the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows > CP1252 would probably give better results in some cases but you still need > to ignore errors in the decode. And of course if the string actually is > utf-8 with non-ascii chars, either of these encodings will return a string > of the wrong length, not just wrong characters; and re-encoding it for > output or storage will completely mangle it. > > Of course we _can_ know the encoding of the filenames in the filesystem, > that's what sys.getfilesystemencoding() is for (see the unicode link > above). Reading file contents and handling stdout/stderr from SCons > subprocesses is much more of a challenge. > > > On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik <[email protected]> > wrote: > >> I found a way to convert any binary string to Unicode without crashing - >> http://stackoverflow.com/a/27527728/239247 That would correctly >> convert all `ascii` characters (and will probably make it possible to use >> ANSI graphics if unicode font supports that), but it will not work for >> other >> utf-8 characters. >> >> Python 3 adds some surrogateescape, but that is not present in Python 2. >> >> http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 >> I don't know why they called it "surrogate" - it is a freaky word. >> >> On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L <[email protected]> >> wrote: >> > I would agree with this. >> > >> > >> > >> > In general the OS today store file data ( ie the file system data not >> the >> > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is >> not >> > always the case it could be big5 or some other locale encoding. On >> Linux >> > there are means to see what the “native” encoding is to use it. >> > >> > >> > >> > I should note that the idea of converting binary to Unicode does not >> really >> > exist. The point of a binary string to is to hold random data ( ie like >> a >> > double in the raw form 64-bit vs the dec values of 1.2385). One can >> assume >> > that it is a certain code page encoding and convert from that. And like >> I >> > stated above there are api to see what the locale code page encoding is >> and >> > that can be used to convert the code to the local ANSI/OEM encoding. >> This is >> > different from a binary string. >> > >> > >> > >> > Jason >> > >> > >> > >> > >> > >> > >> > >> > From: Scons-dev [mailto:[email protected]] On Behalf Of Gary >> > Oberbrunner >> > Sent: Wednesday, May 27, 2015 7:43 AM >> > To: SCons developer list >> > Subject: Re: [Scons-dev] Merge PR #235 before release >> > >> > >> > >> > >> > >> > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <[email protected] >> > >> > wrote: >> > >> > What I need is a bulletproof way to convert from anything to unicode. >> This >> > requires some kind of escaping to go forward and back. Some helper >> > methods like u2b() (unicode to binary) and b2u(). I am quite surprised >> that >> > so far I found nothing for this "simple" case. >> > >> > >> > That's because in general the encoding of the "binary" string is >> unknown. >> > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You >> > can't decode such a string to Unicode without knowing the encoding. >> Check >> > out the python-3 branch where we've been working through some of those >> > issues. Your u2b is "easy" if you assume you want the binary to be >> utf-8 >> > encoded, which is normally safe; this conversion is guaranteed to work. >> > Your b2u is not so easy. You can't just assume utf-8 as you might >> think; if >> > the string has invalid utf-8 bytes it'll raise an error or generate >> dummy >> > chars depending on the args you pass to str.decode(). At least it'll >> get >> > mangled if it's in a different encoding than you expect. >> > >> > >> > >> > -- >> > >> > Gary >> > >> > >> > _______________________________________________ >> > Scons-dev mailing list >> > [email protected] >> > https://pairlist2.pair.net/mailman/listinfo/scons-dev >> > >> >> >> >> -- >> anatoly t. >> _______________________________________________ >> Scons-dev mailing list >> [email protected] >> https://pairlist2.pair.net/mailman/listinfo/scons-dev >> > > > > -- > Gary > > _______________________________________________ > Scons-dev mailing list > [email protected] > https://pairlist2.pair.net/mailman/listinfo/scons-dev > >
_______________________________________________ Scons-dev mailing list [email protected] https://pairlist2.pair.net/mailman/listinfo/scons-dev
