Re: CVS and unicode

Arno Schuring Thu, 08 Sep 2005 02:48:46 -0700

Am Dienstag, 6. September 2005 01:17 schrieb Yves Dorfsman:

Hi,


Has anybody run into problem with GNU CVS and unicode ?

I have made a few tests (with UTF8) and so far it worked, but some of my
users are saying they did run into problem with some files. I can see how
some legal UTF8 characters could be confused as control code/binary.

Does anybody have extensive experience with this ?

Yes.

Encoding problems are operating system / editor side.
CVS does not care about anything regarding the encoding.

Except that the diffs between files are computed byte-wise, notcharacter-wise. This could lead to problems when multi-byte charactersoccur, though these problems are rare at best. My guess is that merging willsucceed most of the time, but the actual diff might contain invalid bytesequences (although - considering cvs' line-by-line diff, that may neveroccur).

Somebody wrote:

In CVS a Unicode file has to be a Binary file (-kb) - which prevents
merging, diffs, etc etc.  If you do not define it as -kb then eventually
the file will be corrupted.

This is completely wrong and lacks any technical substance.

Both pretty firm statements. And both partially right, partially wrong.There has yet to be found the first multi-byte sequence that leads tomerging/diffing problems, but since CVS was not designed for multi-bytecharacters, there may very well be one (+).

The issue is not CVS, the issue is telling your editor about the correctfile
encoding. It's the text editor and how it interprets byte sequences.

I agree partially. Sure, the editor must support multi-byte characters. ButI think it is an error for CVS to start supporting unicode withoutsupporting the environment's native encoding. "we only support unicode ifyou manually save all your text files in UTF-8" does not consitute fullunicode support, I think.

The Unicode thingy in CVSNT is just a hack to work around operating system
issues regarding MS Windows.

I have no idea what unicode support in CVSNT does. I always thought it wasto prevent invalid CR/LF conversions in multi-byte characters. For example,code point 522 is encoded in UTF-16 as 0x01 0x0A. But, I never used CVSNTand don't know whether they even support UTF16 (both BE and LE) or onlyUTF-8. The conversion problem does not even occur in UTF-8, because allmulti-byte components are >= 0x80.

I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded as
well as most of my C++ sources and, of course, all my XML files) for years
now, without any problems.

I do the same, and have not encountered problems so far. But that does notmean there are no issues. What you and I are using is a workaround to useunicode with CVS.

UTF-16 in fact can be problematic. Normal keyword substitution is likelytofail at least with some older versions of CVS. I don't know wether newerCVS
uses wchar instead of char for keyword substitution. UTF-16 isn't in
widespread use, so I didn't care about that yet.

Maybe it would be worth investigating handling all character set conversionsby the client, and using UTF-8 for all repository files always. For as faras you and I can attest, there are (so far) no issues with handling UTF-8files. That way, utf-16 might not be a problem. But then again, if thecharacter support is already in the client, how much effort would it take tomove it to the server too? But I'm not devloping cvs, so I will kindly stoptalking now ;)



Arno



_______________________________________________
Info-cvs mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/info-cvs

Re: CVS and unicode

Reply via email to