Trent Buck <[EMAIL PROTECTED]> added the comment: On Sat, Oct 11, 2008 at 09:04:14PM -0000, Reinier Lamers wrote: > I believe I read in a mailing list thread that darcs can't use a > consistent encoding for metadata, because it uses the metadata for > hashing precisely (bit-by-bit) as it got it from the operating > system.
AIUI darcs currently treats everything as byte vectors. This is fine as long as everyone uses the same character set and encoding. Unfortunately, while that might be true for small groups, it's not true for large, international projects like Darcs itself. Any existing patches recorded by darcs have lost essential information: the encoding of the metadata. It's impossible to get this back reliably. So we have two separate issues: - We need a work around in order to work with existing multi-encoding repositories, including Darcs' own repo. As you say below, probably the best we can do is just throw away any non-ASCII characters :-( - We need to prevent this from happening in future, by either 0) forcing everyone to use UTF-8. I think we can just dismiss this as impossible, if only because of Japan. 1) recording the metadata coding as part of the metadata (as done by MIME for email); or 2) by standardizing on a single coding for internal use (that is, within the actual patches in _darcs), and converting all user input to that coding. The encoding used internally isn't particularly important, but obvious candidates are UTF-8, UTF-16, Unicode codepoint sequences and ISO 10646. Since UTF-8 has useful properties for In both (1) and (2) we need to converted output to the user's coding, with some kind of sensible behaviour when that's not possible (e.g. user is using ISO 8859-1 and the patch author's name contains Greek characters). The iconv(1) tool might be useful as an example of handling such lossy recoding. > Perhaps we can put a declaration in the XML that the encoding is > iso-8859-1 (aka latin1)? There is no such thing as invalid > iso-8859-1, and most data in ASCII-based encoding will look > reasonable in iso-8859-1. While this might work around the immediate issue, it is not a long term solution. If you forcibly treat the entire byte vector as some ASCII-compatible eight-bit encoding (e.g. ISO 8859-1), you will silently(!) get gibberish for - any non-ASCII character in all other ASCII-compatible codings, including UTF-8 and other ISO 8859; and - *ALL* characters in ASCII-incompatible codings, including the popular UTF-16 and JIS. ---------------------------------------------------------------------- As a real-world case study, I compared the Darcs' repo's metadata with and without invalid UTF-8 characters: darcs changes --xml >/tmp/x darcs changes --xml | iconv -c -f utf-8 -t utf-8 >/tmp/y diff -u /tmp/[xy] It appears that 'Daniel Bünzli' is using Latin-1 and every other contributor is using UTF-8, or is using an encoding that happens to silently convert to gibberish when treated as UTF-8. If we treat everything as pure ASCII, we can see that there are only two more cases -- one use of UTF-8 smart quotes, and one UTF-8 ú. darcs changes --xml >/tmp/x darcs changes --xml | iconv -c -f ascii -t utf-8 >/tmp/y diff -u /tmp/[xy] __________________________________ Darcs bug tracker <[EMAIL PROTECTED]> <http://bugs.darcs.net/issue1143> __________________________________ _______________________________________________ darcs-users mailing list darcs-users@darcs.net http://lists.osuosl.org/mailman/listinfo/darcs-users