On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via Digitalmars-d-learn wrote:
This may be endurable if you write an application where Russian is only one of rare options, and what if your whole environment is totally Russian?

You mean if your environment uses a non-UTF encoding? If your environment uses UTF, there is no problem. I have code with strings in Russian (and other languages) embedded, and it's no problem because everything is in Unicode, all input and all output.

No, I mean difficulties to write a program based on non-ASCII locales. Every programming language learning since C starts with a "hello world" program which every non-English programmer essentially tries to translate to native language - and gets unreadable mess on the screen. Thousands try, hundreds look for a solution, dozens find it, and a few continue with the new language. That's not because these programmers cannot read English text-books, they can. That's because they want to write non-English programs for non-English people, and that's essential. And there are many programming languages (or rather their runtimes) which do not suffer such a deficiency.

That's the reason for UNICODE adoption all over the programming world - including D language, but what's the good for me if I can write in a D program a UTF8 string with my native language text, and get the same unreadable mess on the screen?

Yes, a new language in development can lack support for some features, but this forum branch shows that a simple and handy solution exists - yet nobody cares to bring it to the first pages of every text-book for beginners, at least as a footnote. Thus thousands of potential new language fans are lost from start.

But I understand that in Windows you may not have this luxury. So you have to deal with codepages and what-not.

Converting back and forth is not a big problem, and it actually also solves the problem of string comparisons, because std.uni provides utilities for collating strings, etc.. But it only works for Unicode, so you have to convert to Unicode internally anyway. Also, for static strings, it's not hard to make the codepage mapping functions CTFE-able, so you can actually write string literals in a codepage and have the compiler automatically convert it to UTF-8.

The other approach, if you don't like the idea of converting codepages all the time, is to explicitly work in ubyte[] for all strings. Or, preferably, create your own string type with ubyte[] representation underneath, and implement your own comparison functions, etc., then use this type for all strings. Better yet, contribute this to code.dlang.org so that others who have the same problem can reuse your code instead of needing to write their own.

I'd definitely try this if I decide to use D language for my purposes (which not settled yet). But to decide I need some experience, and for now it stopped at reading the user's input (for training I intend to translate into D my recent rather complex interactive C# program).

Still this does not decide localized input problem: any localized input throws an exception “std.utf.UTFException... Invalid UTF-8 sequence”.

Is the exception thrown in readln() or in writeln()? If it's in
writeln(), it shouldn't be a big deal, you just have to pass the data returned by readln() to fromKOI8 (or whatever other codepage you're using).

If the problem is in readln(), then you probably need to read the input in binary (i.e., as ubyte[]) and convert it manually. Unfortunately, there's no other way around this if you're forced to use codepages. The ideal situation is if you can just use Unicode throughout your environment. But of course, sometimes you have no choice.

It depends.

If I avoid proper console code page initializing, I see in debugger that runtime reads the user's input as CP866 (MS DOS) Cyrillic and then throws the exception "Invalid UTF-8 sequence" when trying to handle it as UTF8 string (in particular by strip() or writeln() functions). This situation seems quite manageable by code page conversions you've mentioned above. I've tried first library function found (std.windows.charset), and got a rather fanciful working statement:

response = fromMBSz((readln()~"\0").ptr, 1).strip();

which assigns correct Latin/Cyrillic contents to the response variable.

And if I initialize console with SetConsoleCP(65001) statement things get worse, as I've said above. Then readln() statement returns an empty string and something gets broken inside the runtime, because any further readln() statements do not wait for user input, and return empty strings immediately.




Reply via email to