Hi, Eric Kow <ko...@darcs.net> writes: > So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode > filenames encoded in UTF-8. Does this mean that in the NewFormat, we just > treat filenames as just sequences of bytes? If so that (very superficially > and unthinkingly said) sounds like a step backwards. I wonder what exactly > the thinking behind this was... just to put this line of reasoning back on the right track... The story is that UNIX stores a sequence of bytes for a filepath. This could be basically anything. We never decode anything we read from the filesystem, either. So in OldFormat, what happens is that a sequence of bytes (the filepath) is taken and encoded as a sequence of codepoints < 256 and stored as UTF8.
With new format, the "store as UTF8" bit is skipped, and we just store bytes. It needs to be emphasised, that the UTF8 step in OldFormat is completely superfluous, as the filepath *is never decoded*, so the codepoints are completely bogus. Whether we encode them as UTF8 or as raw bytes doesn't matter, other than that there is an encoding difference for those filepath bytes that fall between 127 and 256 -- with OldFormat, each takes 2 storage bytes while with NewFormat they just use a single storage byte. The net result is that in the common case (i.e. UTF8 filenames), OldFormat is storing double-encoded UTF8. Hope this makes things clearer, and it shows why NewFormat is actually a step forward. Yours, Petr. _______________________________________________ darcs-users mailing list darcs-users@darcs.net http://lists.osuosl.org/mailman/listinfo/darcs-users