Re: Compare string with German umlauts

WebFreak001 via Digitalmars-d-learn Mon, 18 May 2020 07:27:14 -0700

On Monday, 18 May 2020 at 13:44:15 UTC, Martin Tschierschke wrote:

Hi,
I have to find a certain line in a file, with a text containingumlauts.
How do you do this?

The following was not working:

foreach(i,line; file){
 if(line=="My text with ö oe, ä ae or ü"){
   writeln("found it at line",i)
 }
}
I ended up using line.canFind("with part of the text withoutumlaut").
It solved the problem, but what is the right way to use umlauts(encode them) inside the program?

Your code should have already worked like that, assuming yourinput file is a UTF-8 file. Check with an editor like Notepad++or Visual Studio Code what the actual encoding of your text fileis. In D all strings you specify in source are UTF-8 bytes in theend and a byte-by-byte comparison like with your line == "..."will cause it to fail if line is not UTF-8.

My guess is that your file is most likely gonna be encoded inWindows-1251 or Windows-1252. To quickly check if it is UTF-8,print out your strings but with separators between each characterlike using `writefln("%(%s, %)", line.byUTF!dchar);` and see ifit is actually 'M', 'y', ' ', 't', 'e', 'x', 't', ' ', 'w', 'i','t', 'h', ' ', 'ö', ' ', 'o', 'e', ',', ' ', 'ä', ' ', 'a', 'e',' ', 'o', 'r', ' ', 'ü'

If you have identified that the character encoding is indeed yourproblem, interpret your line with the correct character encodingusing


    import std.encoding;
    Windows1252String win1252Line = cast(Windows1252String)line;

and then convert that to utf8:

    string utf8Line;
    transcode(win1252Line, utf8Line);

and then compare that with your input string:

    if (line == "My text with ö oe, ä ae or ü") { ... }

Alternatively you can also change your comparison string to be inWindows 1251/1252 encoding if you know that all your files willhave this encoding, but I would advise against that and insteadsomehow figure out the encoding based on common German charactersor an external library/program and always convert to UTF-8 forall text operations.

Another tip: if you perform case-insensitive comparision withUTF-8, use std.uni : sicmp or icmp (sicmp is faster / lessadvanced) and use like `sicmp(strA, strB) == 0` where you replace== with < or > if you want to sort. Note that this is not boundto any locale and is rather the invariant locale. You willprobably want to use OS APIs or third party libraries to dolocale based text operations (like text in UI)

Re: Compare string with German umlauts

Reply via email to