Am Dienstag, 6. September 2005 01:17 schrieb Yves Dorfsman:
Hi,
Has anybody run into problem with GNU CVS and unicode ?
I have made a few tests (with UTF8) and so far it worked, but some of my
users are saying they did run into problem with some files. I can see how
some legal UTF8 characters could be confused as control code/binary.
Does anybody have extensive experience with this ?
Yes.
Encoding problems are operating system / editor side.
CVS does not care about anything regarding the encoding.
Except that the diffs between files are computed byte-wise, not
character-wise. This could lead to problems when multi-byte characters
occur, though these problems are rare at best. My guess is that merging will
succeed most of the time, but the actual diff might contain invalid byte
sequences (although - considering cvs' line-by-line diff, that may never
occur).
Somebody wrote:
In CVS a Unicode file has to be a Binary file (-kb) - which prevents
merging, diffs, etc etc. If you do not define it as -kb then eventually
the file will be corrupted.
This is completely wrong and lacks any technical substance.
Both pretty firm statements. And both partially right, partially wrong.
There has yet to be found the first multi-byte sequence that leads to
merging/diffing problems, but since CVS was not designed for multi-byte
characters, there may very well be one (+).
The issue is not CVS, the issue is telling your editor about the correct
file
encoding. It's the text editor and how it interprets byte sequences.
I agree partially. Sure, the editor must support multi-byte characters. But
I think it is an error for CVS to start supporting unicode without
supporting the environment's native encoding. "we only support unicode if
you manually save all your text files in UTF-8" does not consitute full
unicode support, I think.
The Unicode thingy in CVSNT is just a hack to work around operating system
issues regarding MS Windows.
I have no idea what unicode support in CVSNT does. I always thought it was
to prevent invalid CR/LF conversions in multi-byte characters. For example,
code point 522 is encoded in UTF-16 as 0x01 0x0A. But, I never used CVSNT
and don't know whether they even support UTF16 (both BE and LE) or only
UTF-8. The conversion problem does not even occur in UTF-8, because all
multi-byte components are >= 0x80.
I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded as
well as most of my C++ sources and, of course, all my XML files) for years
now, without any problems.
I do the same, and have not encountered problems so far. But that does not
mean there are no issues. What you and I are using is a workaround to use
unicode with CVS.
UTF-16 in fact can be problematic. Normal keyword substitution is likely
to
fail at least with some older versions of CVS. I don't know wether newer
CVS
uses wchar instead of char for keyword substitution. UTF-16 isn't in
widespread use, so I didn't care about that yet.
Maybe it would be worth investigating handling all character set conversions
by the client, and using UTF-8 for all repository files always. For as far
as you and I can attest, there are (so far) no issues with handling UTF-8
files. That way, utf-16 might not be a problem. But then again, if the
character support is already in the client, how much effort would it take to
move it to the server too? But I'm not devloping cvs, so I will kindly stop
talking now ;)
Arno
_______________________________________________
Info-cvs mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/info-cvs