Re: [Github-comments] [geany/geany] fails to open Microsoft UTF-16LE file (MSO Word CUSTOM.DIC dictionary file) (#1238)

Colomban Wendling Mon, 19 Sep 2016 03:12:19 -0700

```
$ iconv -f UTF-16LE -t UTF-8 < CUSTOM-utf16le-2016.dic > 
CUSTOM-utf16le-2016.dic_utf8
iconv: illegal input sequence at position 34076
```


Apparently that file has bytes `\xca \xde` near the end, and that doesn't seem 
to be a sequence accepted by iconv (so I'd guess it really is invalid).
And trying and reading how UTF-16 works, it indeed seems invalid: there is the 
sequence `0xD7B8 0xDECA` in the input: `0xD7B8` is [HANGUL JUNGSEONG 
YU-O](http://www.charbase.com/d7b8-unicode-hangul-jungseong-yu-o), which is 
encoded on a single word (below `U+FFFD`, and not in the range `U+D800 - 
U+DFFF`). Next word is `0xDECA`, which, as being in the range `0xDC00 - 
0xDFFF`, should be the second word of a two-word pair.  It is not (the previous 
word not being in the `0xD800 - 0xDBFF` range), so it is invalid.

I tested other editors, like GEdit and `vim`, and they both exhibit the same 
issue failing to properly open the file.  GEdit opens it almost correctly, but 
only up to the invalid sequence, warning that the file might be truncated and 
that saving it could result in data loss.  Vim shows plain garbage.

All I can imagine is that the file is broken, and the other editors you try 
either truncate it, or are more forgiving and leaving the invalid bytes as-is.  
As @elextr explained, we can't really do that because we need UTF-8 encoding in 
the buffer, so need be able to convert to and from it.  With invalid sequences, 
it wouldn't be possible to restore it.

I'm actually fairly curious as to what the editors you see it working with 
actually do with those byte, and if they really don't break the file.

Also, there are fairly odd things even in the part fully valid UTF-16.  Is the 
file really supposed to contain things like `B風e-de-mer` on line 194, `C岡r` on 
line 326`, `d诡rtement` on line 453, or `Ŵolian` on line 1737 (penultimate line, 
and the last before the invalid sequence)?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/geany/geany/issues/1238#issuecomment-247955937

Re: [Github-comments] [geany/geany] fails to open Microsoft UTF-16LE file (MSO Word CUSTOM.DIC dictionary file) (#1238)

Reply via email to