On Wed, Sep 11, 2013 at 2:22 PM, Yunzhong Gao <[email protected]> wrote: > Arthur wrote: > > If #include directives will use UTF-8, then __FILE__ must also use UTF-8, > so > > that this will work: > > > > #include __FILE__ > > > > And I would expect #line directives also to use UTF-8. The only good > rationale > > I can imagine is that you're dealing with badly behaved third-party > generators > > such as lex/yacc which dump malformed #line directives into the source > file. > > > > The patch looks good to me, but the stated rationale is misleading; I > don't > > think this patch helps with anything on a well-behaved system (even one > > where the filesystem charset is Shift-JIS). It merely helps Clang > not-barf on > > malformed input (such as that produced by a badly behaved lex/yacc). > > For some reason, your replies just won't appear in Phabricator while Eli's > went > in just fine. Weird.
Phabricator requires you to sign in with your Facebook account, which I don't particularly want to do, so all my replies are sent as email messages instead of Phabricator comments. > I think, a UTF-8 encoded source file should not contain shift-jis encoded > lines like this: > #include "こんにちは.c" That's UTF-8! :D But I take it you mean that the user's source file should not look like this in "od -t x1": 0000000 23 69 6e 63 6c 75 64 65 20 22 82 b1 82 f1 82 c9 0000020 82 bf 82 cd 2e 63 22 0a 0000030 Instead, it should look like this: 0000000 23 69 6e 63 6c 75 64 65 20 22 e3 81 93 e3 82 93 0000020 e3 81 ab e3 81 a1 e3 81 af 2e 63 22 0a 0000035 (That's the same Japanese text, simply encoded in UTF-8 instead of Shift-JIS. We've already agreed that Clang expects all #include directives to consist of UTF-8-encoded text.) > But it is okay to have lines like this: > #include "\202\261\202\361\202\311\202\277\202\315.c" It's *okay* to have that line, but it doesn't mean what you think it means. First, the backslashes are problematic (at least according to the C++ standard); I don't actually know off the top of my head whether this would try to open the "202" directory on Windows. Secondly, that's not a valid filename according to the rules of #include, which (as we've already agreed) expects all #include directives to consist of UTF-8-encoded text. > You might be right that the current patch does not help the compiler find > the included file Well, then it shouldn't be pushed. Only patches that help should be pushed. :P > The equivalent UTF-8 encoded file name like the following might help the > compiler find the file: > #include "\343\203\231\343\203\274\343\202\267\343\203\203\343\202\257.c" If I were the programmer, I would simply write > #include "こんにちは.c" This should work fine on all filesystems whose native character sets encode those particular glyphs. UTF-8, UTF-16, Shift-JIS, EUC... all should work fine. Translating between UTF-8 and local filesystem encodings is a solved problem. –Arthur _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
