On Tuesday, January 13, 2015 at 7:43:07 PM UTC-5, [email protected] wrote:
> Which can present problems if the UTF8String is displayed or otherwise > used where valid UTF8 is required. > It will display as mojibake, but you will still be able to open the file. There doesn't seem to be much alternative, since you can't reliably detect an unknown encoding. You can still do processing that treats the filename like a cookie, e.g. concatenation. You can still detect ASCII suffixes (e.g. ".txt") via endswith even if other code units are invalid, as long as it is an encoding where ASCII is left as-is; otherwise there is not much recourse anyway. But in general there is not much non-cookie processing that you can expect to do on strings in an unknown encoding. (Apparently this is what Go does too?) > Maybe readdir should in fact return a raw bytestring as advertised in the > documentation > http://docs.julialang.org/en/release-0.3/stdlib/io-network/?highlight=readdir#Base.readdir > A "ByteString" in Julia is just an alias for Union(ASCIIString,UTF8String) — i.e. a string where the code units are bytes — so it is doing what it is documented as doing. (And as I mentioned, in the future ASCIIString will probably disappear and it will just be UTF8String.) There is no other "raw bytestring" type. It sounds like you don't want a "string" at all, you just want an array of bytes, i.e. a Vector{Uint8}. But I tend to think it is better and more convenient to work with the data as a UTF8String and have it display sensibly 99.99...% of the time, and work with string operations like concatenation, even if you occasionally display mojibake in rare cases.
