Re: [darcs-users] darcs and non-ASCII characters

Wolfgang Jeltsch Sat, 22 Oct 2005 10:23:23 -0700

Am Samstag, 22. Oktober 2005 13:46 schrieb Tommy Pettersson:
> On Fri, Oct 21, 2005 at 02:58:10PM +0200, Wolfgang Jeltsch wrote:
> > how does darcs handle non-ASCII characters in filenames, patch names and
> > long comments?  What happens, for example, if someone who uses a
> > different character encoding than me fetches a copy of my repository via
> > darcs get? Will non-ASCII characters be properly interpreted?
>
> When outputting to the terminal darcs is supposed to escape
> everything that is not printable ASCII by default.


Yes, it does so.

> There are some environment variables described in section 'Character
> escaping and non-ASCII character encodings' in the manual
> <http://www.darcs.net/manual/> to allow 8-bit chars and UTF8.

I use an UTF8 locale and tried DARCS_DONT_ESCAPE_8BIT=1.  Alas, darcs still 
prints non-ASCII characters in escaped form but I think this is because I use 
darcs 1.0.2.

> Darcs does not interpret the encoding of patch names or patch comments, so
> the output can probably look garbage if looked at with a wrong locale.

That's probably bad.  Patch names and patch comments are character sequences.  
Therefore they probably shouldn't be handled as just byte sequences but the 
encoding should be taken into account.  For example, if I commit a patch with 
a name containing non-ASCII characters using a Latin-1 locale, the patch name 
should be the same for an user using a UTF-8 locale.

> The same should be true for file names, but I believe there is some
> interference with a Haskell module that uses UTF8, and I've sometimes seen
> UTF8 in output of filenames event hough I use Latin1, which should be
> regarded as bugs.

And I see "double-UTF-8" in output.  I use an UTF-8 locale.  darcs takes my 
UTF-8 encoded filenames, interprets them as Latin-1 encoded and does a 
Latin-1-to-UTF-8 conversion before outputting them on the terminal.  So a 
character like ä is represented by 4 bytes then.  Well this might be just a 
presentational problem.

I suppose that darcs stores the byte sequence which makes up a filename 
verbatim, not taking any encodings into account.  In other words, for darcs, 
filenames seem to be byte sequences instead of character sequences.  The 
question is if this is a good behavior.  At least, it avoids problems if, for 
example, a Makefile refers to a file.  If filenames would be treated as 
character sequences, the underlying byte sequence would change if a different 
encoding is used.  But the byte sequence in the Makefile won't change so the 
Makefile won't work correctly anymore.

Is it really true that filenames are just byte sequences for darcs and no 
character encodings are taken into account when storing and retrieving 
filenames?  Or are filenames treated as character sequences?  Or are they 
treated non-uniformly which would mean that darcs is buggy at this point and 
one should avoid filenames with non-ASCII characters?

Best wishes,
Wolfgang

_______________________________________________
darcs-users mailing list
[email protected]
http://www.abridgegame.org/mailman/listinfo/darcs-users

Re: [darcs-users] darcs and non-ASCII characters

Reply via email to