I hate to ping the list again, but since I didn't get a response in a week it makes me wonder if I committed a faux pas by mailing the wrong list or having an over-large patch.
I'd like to get something working, even if it's a smaller patch. -Scott On Sat, Jul 16, 2011 at 8:23 PM, Scott Conger <[email protected]> wrote: > And I attached the wrong llvm diff. Here is the correct one. > > On Sat, Jul 16, 2011 at 8:21 PM, Scott Conger <[email protected]> wrote: >> Attached patch adds support for -finput-charset and automatic text >> conversion when there are multibyte characters or a byte-order-mark is >> present. The net effect is that all internal text should now be in >> UTF-8. >> >> I have the exec charset options mostly working, but I trimmed it down >> to this for now, as it's a decently sized patch as-is. >> >> >> Performance impact: >> >> At a minimum, we have to scan through the input text to see if there >> are any multi-byte characters. There are usually none as portable code >> won't have any. The cost of this is lower if you have SSE2 support as >> I added an optimized version using intrinsics: >> >> For 1000 calls against a 16 MB ASCII buffer, on an AMD Athlon 7850 >> (2.81 Ghz) rough costs with GCC were: >> Default checkAscii - 13050 ms >> SSE2 checkAscii - 4025 ms >> >> If you do use -finput-charset, there is multi-byte text, or some >> byte-order-mark is present, the cost to convert the text to UTF-8 is >> somewhere between 10 to 20 times higher than the default checkAscii >> implementation. It varies considerably depending on the input and >> character set. >> >> As a special case, UTF-8 input avoids most of this cost and it just >> checks that it's valid UTF-8. >> >> GCC differences: >> >> * Didn't add GCC's support for IBM character encodings, although >> -finput-charset should work if iconv supports it. >> * Didn't add GCC's special handling of a few character sets like >> Shift-Jis when no iconv present. >> * GCC's only seems to do byte-order-mark detection if the underlying >> iconv does, which apparently varies. >> >> Issues: >> >> * It turned out to be quite ugly to get iconv working on Windows. See >> comment in NativeIconv.cpp. If what's there is objectionable, I'd >> prefer to rip out Windows support of iconv for now. >> * Difficult to automatically test as iconv implementations support >> very different sets of encodings. >> * It looks like I picked up some non-checked in changes when I >> regenerated configure relating to a bug report URL? >> >> Testing: >> >> Did Linux GCC, Windows Visual Studio 10 and Cygwin GCC builds. Ran all >> tests on Linux. >> >> You can run a simple input conversion test like so: >> >> sconger@scott-ubuntu:~/dev/llvmpatch/build$ iconv -f ASCII -t UTF-16BE >> test.c > test_utf16be.c >> sconger@scott-ubuntu:~/dev/llvmpatch/build$ ./bin/clang >> -finput-charset=UTF-16BE test_utf16be.c >> sconger@scott-ubuntu:~/dev/llvmpatch/build$ ./a.out >> Hello World >> >> -Scott >> > _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
