On Thursday, January 15, 2015 at 3:02:13 AM UTC+10, Steven G. Johnson wrote: > > > > On Tuesday, January 13, 2015 at 10:38:23 PM UTC-5, [email protected] wrote: >> >> Probably right if the mutations for adding extensions etc are not >> conveniently available with Vector{uint8}. >> > > It would certainly be possible to define these operations, e.g. > concatenation of a string with a bytevector. But even then I think a > bytevector would be the wrong choice. When I look at a filename, I don't > want to see UInt8[0x66,0x6f,0x6f,0x2e,0x74,0x78,0x74], I want to see > "foo.txt". >
Good point, I would too. > And by returning a (potentially invalid) UTF8String, that's what I get in > the *vast* majority of cases—non-UTF8 filenames seem to be pretty rare > nowadays even on Unix systems (e.g. many GNU/Linux systems have defaulted > to displaying filenames as UTF-8 for a decade now). > I see it mostly in non-English locales on Windows or where windows disks are mounted on Linux, and it is still a fairly common problem. Those of us in English locales shouldn't extrapolate our easy ride with encodings to the rest of the world :) > Even for a non-UTF8 filename where I get mojibake, in most cases it will > be in some other 1-byte superset of ASCII, so the displayed results will > still be somewhat useful: I'd much rather see "Foo££££????.txt" than > a list of byte values. > > I guess a third alternative would be to define an UnknownEncodingString > type that stores an array of bytes and displays by default as UTF-8 (or > even tries to guess the encoding) and supports concatenation and a few > other carefully chosen operations, but not iteration over codepoints and > other things that can't be implemented without knowing the encoding. The > idea being to prevent programmers from trying to perform operations on > filenames that may fail on strings with unknown encodings. But this seems > like it would be a lot of hassle for little benefit these days. > To me this would be preferable, that way functions that treat the UnknownEncodingString as UTF-8 know they have to be robust when encountering invalid UTF-8. Otherwise it might be necessary for all UTF-8 handling functions to check for invalid sequences so they don't cause problems, which has a possible performance penalty. I have done work on an application that tries to guess encodings, and other than ASCII or valid UTF-8, its not often right. :) It very much depends on the order of trying alternatives, and it is performance intensive, though most filenames are so short this wouldn't matter so much.
