Hi,

Eric Kow <ko...@darcs.net> writes:
> So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
> filenames encoded in UTF-8.  Does this mean that in the NewFormat, we just
> treat filenames as just sequences of bytes?  If so that (very superficially
> and unthinkingly said) sounds like a step backwards.  I wonder what exactly
> the thinking behind this was...
just to put this line of reasoning back on the right track... The story
is that UNIX stores a sequence of bytes for a filepath. This could be
basically anything. We never decode anything we read from the
filesystem, either. So in OldFormat, what happens is that a sequence of
bytes (the filepath) is taken and encoded as a sequence of codepoints <
256 and stored as UTF8.

With new format, the "store as UTF8" bit is skipped, and we just store
bytes.

It needs to be emphasised, that the UTF8 step in OldFormat is completely
superfluous, as the filepath *is never decoded*, so the codepoints are
completely bogus. Whether we encode them as UTF8 or as raw bytes doesn't
matter, other than that there is an encoding difference for those
filepath bytes that fall between 127 and 256 -- with OldFormat, each
takes 2 storage bytes while with NewFormat they just use a single
storage byte.

The net result is that in the common case (i.e. UTF8 filenames),
OldFormat is storing double-encoded UTF8.

Hope this makes things clearer, and it shows why NewFormat is actually a
step forward.

Yours,
   Petr.
_______________________________________________
darcs-users mailing list
darcs-users@darcs.net
http://lists.osuosl.org/mailman/listinfo/darcs-users

Reply via email to