On Tuesday, January 13, 2015 at 7:43:07 PM UTC-5, [email protected] wrote:

> Which can present problems if the UTF8String is displayed or otherwise 
> used where valid UTF8 is required. 
>

It will display as mojibake, but you will still be able to open the file.   
There doesn't seem to be much alternative, since you can't reliably detect 
an unknown encoding.    You can still do processing that treats the 
filename like a cookie, e.g. concatenation.  You can still detect ASCII 
suffixes (e.g. ".txt") via endswith even if other code units are invalid, 
as long as it is an encoding where ASCII is left as-is; otherwise there is 
not much recourse anyway.  But in general there is not much non-cookie 
processing that you can expect to do on strings in an unknown encoding.

(Apparently this is what Go does too?)
 

> Maybe readdir should in fact return a raw bytestring as advertised in the 
> documentation 
> http://docs.julialang.org/en/release-0.3/stdlib/io-network/?highlight=readdir#Base.readdir
>

A "ByteString" in Julia is just an alias for Union(ASCIIString,UTF8String) 
— i.e. a string where the code units are bytes — so it is doing what it is 
documented as doing.   (And as I mentioned, in the future ASCIIString will 
probably disappear and it will just be UTF8String.)  There is no other "raw 
bytestring" type.

It sounds like you don't want a "string" at all, you just want an array of 
bytes, i.e. a Vector{Uint8}.   But I tend to think it is better and more 
convenient to work with the data as a UTF8String and have it display 
sensibly 99.99...% of the time, and work with string operations like 
concatenation, even if you occasionally display mojibake in rare cases.

Reply via email to