On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
In my efforts to learn D I am writing some code to read files
in different UTF encodings with the aim of having them end up
as UTF-8 internally. As a start I have the following code:
import std.stdio;
import std.file;
void main(string[] args)
{
if (args.length == 2)
{
if (args[1].exists && args[1].isFile)
{
auto f = File(args[1]);
writeln(args[1]);
for (auto i = 1; i <= 3; ++i)
write(f.readln);
}
}
}
It works well outputting the file name and first three lines of
the file properly, without any regard to the encoding of the
file. The exception to this is if the file is UTF-16, with both
LE and BE encodings, two characters representing the BOM are
printed.
I assume that write detects the encoding of the string returned
by readln and prints it correctly rather than readln reading in
as a consistent encoding. Is this correct?
Is there a way to remove the BOM from the input buffer and
still know the encoding of the file?
Is there a D idiomatic way to do what I want to do?
Mike
all strings in D are _assumed_ to be UTF-8, so your I/O reading
function needs to check that it is actually UTF-8.
File/File.readln does not do that, so you are actually getting
UTF-16 bytes in your string, not UTF-8 bytes.
What you are seeing through writeln is not fully correct: If you
only test English characters there are null bytes (0) before each
english character byte, which aren't rendered in the console.
You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);
result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72
00 6c 00 64 00 21 00 0d 00 0a
You can see there is a UTF-16 LE BOM in there and then all the
English characters, which are null bytes and the characters.
Basically what you want to do is writing a function converting
the input encoding to UTF-8 so you can use it in D strings. If
you want to get into this yourself, there is std.encoding which
offers you most of the functionality:
https://dlang.org/phobos/std_encoding.html
You can use the getBOM method to try to determine encoding by BOM
and can then convert it to UTF-8 from the source encoding using
the `transcode` method or manually using the encoding classes.
If you don't have a BOM you need some kind of algorithm to
determine the encoding of your file. If you don't want to do that
and just want to check if it's UTF-8, use the `std.utf :
validate` function which throws if your string is not valid UTF-8
and otherwise does nothing.
If you only support UTF-8, UTF-16 and possibly UTF-32, you can
use just the std.utf methods after determining BOM to lazily
convert without allocating memory (useful if you only go over
your string linearly without going back):
import std;
void main() {
// readln is actually rather unsafe for this! You should use
std.file.read
// or File.rawRead or read byte chunks instead. For chunking you
need to adjust
// the encode API however and probably make a helper struct with
a small buffer.
string s = File("test.txt").readln;
// need to remove the line terminator before encoding
// (it's encoded in UTF-8, potentially after UTF-16)
if (s.length) s = s[0 .. $ - 1];
string e = encodeUTF8(s.representation);
writefln("%s\n(%(%02x %))", s, s.representation);
writefln("%s\n(%(%02x %))", e, e.representation);
}
string encodeUTF8(immutable(ubyte)[] bytes) {
auto bom = getBOM(bytes);
bytes = bytes[bom.sequence.length .. $];
switch (bom.schema) {
// optionally we could validate, but we just trust because there
is a UTF-8 BOM
case BOM.utf8: return cast(string)bytes;
case BOM.utf16le: return convertUTF!wchar(bytes, true);
case BOM.utf16be: return convertUTF!wchar(bytes, false);
case BOM.utf32le: return convertUTF!dchar(bytes, true);
case BOM.utf32be: return convertUTF!dchar(bytes, false);
default: string input = cast(string)bytes; validate(input);
return input;
}
}
private string convertUTF(T)(scope const(ubyte)[] bytes, bool
littleEndian) {
// T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];
if (bytes.length % T.sizeof != 0)
throw new Exception("File is " ~ name ~ ", but got "
~ bytes.length.to!string ~ " bytes, which is not a multiple of
"
~ T.sizeof.stringof ~ "!");
scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length /
T.sizeof];
// swap mismatching endianness
version (LittleEndian) // CPU is little endian, swap if file is
big endian
bool swap = !littleEndian;
else // CPU is big endian, swap if file is little endian
bool swap = littleEndian;
if (swap) swapAllEndian(units);
scope wstr = cast(const(T)[]) units;
auto ret = wstr.toUTF8;
// because we are operating in-place, we need to revert to keep
memory consistent
// if you don't use the byte data anywhere else, you could omit
this
// (note this could be unsafe though)
if (swap) swapAllEndian(units);
return ret;
}
private void swapAllEndian(T)(T[] data) {
// TODO: could probably optimize this with SIMD instructions
foreach (i; 0 .. data.length)
data[i] = swapEndian(data[i]);
}
Example library if you want to guess encoding without BOM:
https://code.dlang.org/packages/libguess-d
used API docs:
https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping
BE/LE for native encoding
https://dlang.org/phobos/std_encoding.html <- BOM detection,
transcoding capabilities for encodings other than UTF
https://dlang.org/phobos/std_utf.html <- low level UTF-8
encoding/decoding, lazy decoding, validation