Re: Files and UTF

WebFreak001 via Digitalmars-d-learn Thu, 06 Aug 2020 00:21:08 -0700

On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:

In my efforts to learn D I am writing some code to read filesin different UTF encodings with the aim of having them end upas UTF-8 internally. As a start I have the following code:
import std.stdio;
import std.file;

void main(string[] args)
{
    if (args.length == 2)
    {
        if (args[1].exists && args[1].isFile)
        {
            auto f = File(args[1]);
            writeln(args[1]);

            for (auto i = 1; i <= 3; ++i)
                write(f.readln);
        }
    }
}
It works well outputting the file name and first three lines ofthe file properly, without any regard to the encoding of thefile. The exception to this is if the file is UTF-16, with bothLE and BE encodings, two characters representing the BOM areprinted.
I assume that write detects the encoding of the string returnedby readln and prints it correctly rather than readln reading inas a consistent encoding. Is this correct?
Is there a way to remove the BOM from the input buffer andstill know the encoding of the file?
Is there a D idiomatic way to do what I want to do?

Mike

all strings in D are _assumed_ to be UTF-8, so your I/O readingfunction needs to check that it is actually UTF-8.File/File.readln does not do that, so you are actually gettingUTF-16 bytes in your string, not UTF-8 bytes.

What you are seeing through writeln is not fully correct: If youonly test English characters there are null bytes (0) before eachenglish character byte, which aren't rendered in the console.


You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);

result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 7200 6c 00 64 00 21 00 0d 00 0a

You can see there is a UTF-16 LE BOM in there and then all theEnglish characters, which are null bytes and the characters.

Basically what you want to do is writing a function convertingthe input encoding to UTF-8 so you can use it in D strings. Ifyou want to get into this yourself, there is std.encoding whichoffers you most of the functionality:https://dlang.org/phobos/std_encoding.html

You can use the getBOM method to try to determine encoding by BOMand can then convert it to UTF-8 from the source encoding usingthe `transcode` method or manually using the encoding classes.

If you don't have a BOM you need some kind of algorithm todetermine the encoding of your file. If you don't want to do thatand just want to check if it's UTF-8, use the `std.utf :validate` function which throws if your string is not valid UTF-8and otherwise does nothing.

If you only support UTF-8, UTF-16 and possibly UTF-32, you canuse just the std.utf methods after determining BOM to lazilyconvert without allocating memory (useful if you only go overyour string linearly without going back):




import std;

void main() {

// readln is actually rather unsafe for this! You should usestd.file.read// or File.rawRead or read byte chunks instead. For chunking youneed to adjust// the encode API however and probably make a helper struct witha small buffer.

        string s = File("test.txt").readln;
        // need to remove the line terminator before encoding
        // (it's encoded in UTF-8, potentially after UTF-16)
        if (s.length) s = s[0 .. $ - 1];
        string e = encodeUTF8(s.representation);

        writefln("%s\n(%(%02x %))", s, s.representation);
        writefln("%s\n(%(%02x %))", e, e.representation);
}

string encodeUTF8(immutable(ubyte)[] bytes) {
        auto bom = getBOM(bytes);
        bytes = bytes[bom.sequence.length .. $];

        switch (bom.schema) {

// optionally we could validate, but we just trust because thereis a UTF-8 BOM

        case BOM.utf8: return cast(string)bytes;
        case BOM.utf16le: return convertUTF!wchar(bytes, true);
        case BOM.utf16be: return convertUTF!wchar(bytes, false);
        case BOM.utf32le: return convertUTF!dchar(bytes, true);
        case BOM.utf32be: return convertUTF!dchar(bytes, false);

default: string input = cast(string)bytes; validate(input);return input;

}
}

private string convertUTF(T)(scope const(ubyte)[] bytes, boollittleEndian) {

        // T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
        enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
        alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];

        if (bytes.length % T.sizeof != 0)
                throw new Exception("File is " ~ name ~ ", but got "

~ bytes.length.to!string ~ " bytes, which is not a multiple of"

                        ~ T.sizeof.stringof ~ "!");

scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length /T.sizeof];


        // swap mismatching endianness

version (LittleEndian) // CPU is little endian, swap if file isbig endian

                bool swap = !littleEndian;
        else // CPU is big endian, swap if file is little endian
                bool swap = littleEndian;

        if (swap) swapAllEndian(units);

        scope wstr = cast(const(T)[]) units;
        auto ret = wstr.toUTF8;

// because we are operating in-place, we need to revert to keepmemory consistent// if you don't use the byte data anywhere else, you could omitthis

        // (note this could be unsafe though)
        if (swap) swapAllEndian(units);

        return ret;
}

private void swapAllEndian(T)(T[] data) {
        // TODO: could probably optimize this with SIMD instructions
        foreach (i; 0 .. data.length)
                data[i] = swapEndian(data[i]);
}

Example library if you want to guess encoding without BOM:https://code.dlang.org/packages/libguess-d


used API docs:

https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swappingBE/LE for native encodinghttps://dlang.org/phobos/std_encoding.html <- BOM detection,transcoding capabilities for encodings other than UTFhttps://dlang.org/phobos/std_utf.html <- low level UTF-8encoding/decoding, lazy decoding, validation

Re: Files and UTF

Reply via email to