$ iconv -f UTF-16LE -t UTF-8 < CUSTOM-utf16le-2016.dic >
iconv: illegal input sequence at position 34076
Apparently that file has bytes `\xca \xde` near the end, and that doesn't seem
to be a sequence accepted by iconv (so I'd guess it really is invalid).
And trying and reading how UTF-16 works, it indeed seems invalid: there is the
sequence `0xD7B8 0xDECA` in the input: `0xD7B8` is [HANGUL JUNGSEONG
YU-O](http://www.charbase.com/d7b8-unicode-hangul-jungseong-yu-o), which is
encoded on a single word (below `U+FFFD`, and not in the range `U+D800 -
U+DFFF`). Next word is `0xDECA`, which, as being in the range `0xDC00 -
0xDFFF`, should be the second word of a two-word pair. It is not (the previous
word not being in the `0xD800 - 0xDBFF` range), so it is invalid.
I tested other editors, like GEdit and `vim`, and they both exhibit the same
issue failing to properly open the file. GEdit opens it almost correctly, but
only up to the invalid sequence, warning that the file might be truncated and
that saving it could result in data loss. Vim shows plain garbage.
All I can imagine is that the file is broken, and the other editors you try
either truncate it, or are more forgiving and leaving the invalid bytes as-is.
As @elextr explained, we can't really do that because we need UTF-8 encoding in
the buffer, so need be able to convert to and from it. With invalid sequences,
it wouldn't be possible to restore it.
I'm actually fairly curious as to what the editors you see it working with
actually do with those byte, and if they really don't break the file.
Also, there are fairly odd things even in the part fully valid UTF-16. Is the
file really supposed to contain things like `B風e-de-mer` on line 194, `C岡r` on
line 326`, `d诡rtement` on line 453, or `Ŵolian` on line 1737 (penultimate line,
and the last before the invalid sequence)?
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub: