On 09/03/2017 01:39 AM, Ali Çehreli wrote:
Ok, I see that I made a mistake but I still don't think the conversion is one way. If we can convert byte-by-byte, we should be able to convert back byte-by-byte, right?

You weren't converting byte-by-byte. You were only converting the significant bytes of the code points, throwing away leading zeroes.

What I failed to ensure was to iterate by code units.

A UTF-8 code unit is a byte, so "%02x" is enough, yes. But for UTF-16 and UTF-32 code units, it's not. You need to match the format width to the size of the code unit.

Or maybe just convert everything to UTF-8 first. That also sidesteps any endianess issues.

The following is able to get the same string back:

import std.stdio;
import std.string;
import std.algorithm;
import std.range;
import std.utf;
import std.conv;

auto toHex(R)(R input) {
     // As Moritz Maxeiner says, this format is expensive
     return input.byCodeUnit.map!(c => format!"%02x"(c)).joiner;
}

int hexValue(C)(C c) {
     switch (c) {
     case '0': .. case '9':
         return c - '0';
     case 'a': .. case 'f':
         return c - 'a' + 10;
     default:
         assert(false);
     }
}

auto fromHex(R, Dst = char)(R input) {
     return input.chunks(2).map!((ch) {
             auto high = ch.front.hexValue * 16;
             ch.popFront();
             return high + ch.front.hexValue;
         }).map!(value => cast(Dst)value);
}

void main() {
     assert("AAA".toHex.fromHex.equal("AAA"));

     assert("ö…".toHex.fromHex.equal("ö…".byCodeUnit));
     // Alternative check:
     assert("ö…".toHex.fromHex.text.equal("ö…"));
}

Still fails with UTF-16 and UTF-32 strings:

----
writeln("…"w.toHex.fromHex.text); /* prints " &" */
writeln("…"d.toHex.fromHex.text); /* prints " &" */
----

Reply via email to