On Thursday, January 15, 2015 at 3:02:13 AM UTC+10, Steven G. Johnson wrote:
>
>
>
> On Tuesday, January 13, 2015 at 10:38:23 PM UTC-5, [email protected] wrote:
>>
>> Probably right if the mutations for adding extensions etc are not 
>> conveniently available with Vector{uint8}.
>>
>
> It would certainly be possible to define these operations, e.g. 
> concatenation of a string with a bytevector.   But even then I think a 
> bytevector would be the wrong choice.   When I look at a filename, I don't 
> want to see UInt8[0x66,0x6f,0x6f,0x2e,0x74,0x78,0x74], I want to see 
> "foo.txt". 
>

Good point, I would too.
 

>  And by returning a (potentially invalid) UTF8String, that's what I get in 
> the *vast* majority of cases—non-UTF8 filenames seem to be pretty rare 
> nowadays even on Unix systems (e.g. many GNU/Linux systems have defaulted 
> to displaying filenames as UTF-8 for a decade now).  
>

I see it mostly in non-English locales on Windows or where windows disks 
are mounted on Linux, and it is still a fairly common problem.  Those of us 
in English locales shouldn't extrapolate our easy ride with encodings to 
the rest of the world :)

 

> Even for a non-UTF8 filename where I get mojibake, in most cases it will 
> be in some other 1-byte superset of ASCII, so the displayed results will 
> still be somewhat useful: I'd much rather see "Foo££££????.txt" than 
> a list of byte values.
>
> I guess a third alternative would be to define an UnknownEncodingString 
> type that stores an array of bytes and displays by default as UTF-8 (or 
> even tries to guess the encoding) and supports concatenation and a few 
> other carefully chosen operations, but not iteration over codepoints and 
> other things that can't be implemented without knowing the encoding.  The 
> idea being to prevent programmers from trying to perform operations on 
> filenames that may fail on strings with unknown encodings.   But this seems 
> like it would be a lot of hassle for little benefit these days.
>

To me this would be preferable, that way functions that treat the 
 UnknownEncodingString as UTF-8 know they have to be robust when 
encountering invalid UTF-8.  Otherwise it might be necessary for all UTF-8 
handling functions to check for invalid sequences so they don't cause 
problems, which has a possible performance penalty.

I have done work on an application that tries to guess encodings, and other 
than ASCII or valid UTF-8, its not often right.  :) It very much depends on 
the order of trying alternatives, and it is performance intensive, though 
most filenames are so short this wouldn't matter so much.

Reply via email to