Re: [Scons-dev] Merge PR #235 before release

Bill Deegan Thu, 28 May 2015 06:51:04 -0700

Sounds like we should use sys.getfilesystemencoding().
+1

The only trick might be if you had several filesystems each of which had
different filesystemencoding()...
May just be best to back out the patch for now.


-Bill

On Thu, May 28, 2015 at 3:54 AM, Gary Oberbrunner <[email protected]>
wrote:

> If you're interested in this problem, I suggest reading
> https://docs.python.org/2/howto/unicode.html which has all the details
> (including how to ignore decode errors), and of course check out the
> python3 branch of scons where a lot of unicode handling has been done (but
> much is still left to do iirc).  I don't think pretending strings are in
> the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows
> CP1252 would probably give better results in some cases but you still need
> to ignore errors in the decode.  And of course if the string actually is
> utf-8 with non-ascii chars, either of these encodings will return a string
> of the wrong length, not just wrong characters; and re-encoding it for
> output or storage will completely mangle it.
>
> Of course we _can_ know the encoding of the filenames in the filesystem,
> that's what sys.getfilesystemencoding() is for (see the unicode link
> above). Reading file contents and handling stdout/stderr from SCons
> subprocesses is much more of a challenge.
>
>
> On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik <[email protected]>
> wrote:
>
>> I found a way to convert any binary string to Unicode without crashing -
>> http://stackoverflow.com/a/27527728/239247 That would correctly
>> convert all `ascii` characters (and will probably make it possible to use
>> ANSI graphics if unicode font supports that), but it will not work for
>> other
>> utf-8 characters.
>>
>> Python 3 adds some surrogateescape, but that is not present in Python 2.
>>
>> http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2
>> I don't know why they called it "surrogate" - it is a freaky word.
>>
>> On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L <[email protected]>
>> wrote:
>> > I would agree with this.
>> >
>> >
>> >
>> > In general the OS today store file data ( ie the file system data not
>> the
>> > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is
>> not
>> > always the case it could be big5 or some other locale encoding.  On
>> Linux
>> > there are means to see what the “native” encoding is to use it.
>> >
>> >
>> >
>> > I should note that the idea of converting binary to Unicode does not
>> really
>> > exist. The point of a binary string to is to hold random data ( ie like
>> a
>> > double in the raw form 64-bit vs the dec values of 1.2385). One can
>> assume
>> > that it is a certain code page encoding and convert from that. And like
>> I
>> > stated above there are api to see what the locale code page encoding is
>> and
>> > that can be used to convert the code to the local ANSI/OEM encoding.
>> This is
>> > different from a binary string.
>> >
>> >
>> >
>> > Jason
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Scons-dev [mailto:[email protected]] On Behalf Of Gary
>> > Oberbrunner
>> > Sent: Wednesday, May 27, 2015 7:43 AM
>> > To: SCons developer list
>> > Subject: Re: [Scons-dev] Merge PR #235 before release
>> >
>> >
>> >
>> >
>> >
>> > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <[email protected]
>> >
>> > wrote:
>> >
>> > What I need is a bulletproof way to convert from anything to unicode.
>> This
>> > requires some kind of escaping to go forward and back. Some helper
>> > methods like u2b() (unicode to binary) and b2u(). I am quite surprised
>> that
>> > so far I found nothing for this "simple" case.
>> >
>> >
>> > That's because in general the encoding of the "binary" string is
>> unknown.
>> > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else?  You
>> > can't decode such a string to Unicode without knowing the encoding.
>> Check
>> > out the python-3 branch where we've been working through some of those
>> > issues.  Your u2b is "easy" if you assume you want the binary to be
>> utf-8
>> > encoded, which is normally safe; this conversion is guaranteed to work.
>> > Your b2u is not so easy.  You can't just assume utf-8 as you might
>> think; if
>> > the string has invalid utf-8 bytes it'll raise an error or generate
>> dummy
>> > chars depending on the args you pass to str.decode().  At least it'll
>> get
>> > mangled if it's in a different encoding than you expect.
>> >
>> >
>> >
>> > --
>> >
>> > Gary
>> >
>> >
>> > _______________________________________________
>> > Scons-dev mailing list
>> > [email protected]
>> > https://pairlist2.pair.net/mailman/listinfo/scons-dev
>> >
>>
>>
>>
>> --
>> anatoly t.
>> _______________________________________________
>> Scons-dev mailing list
>> [email protected]
>> https://pairlist2.pair.net/mailman/listinfo/scons-dev
>>
>
>
>
> --
> Gary
>
> _______________________________________________
> Scons-dev mailing list
> [email protected]
> https://pairlist2.pair.net/mailman/listinfo/scons-dev
>
>

_______________________________________________
Scons-dev mailing list
[email protected]
https://pairlist2.pair.net/mailman/listinfo/scons-dev

Re: [Scons-dev] Merge PR #235 before release

Reply via email to