Filename character translation
Hi, I came across the problem with rsync-2.5.5 on Cygwin/Win2K while rsyncing with filenames which have 'strange' (non latin-1) characters in filenames. The problem is that filenames on Windows system are coded (in our case) in codepage 852, while server (Linux system) has filename coding according to ISO-8859-2. This two are not fully compatible, causing rsync to simply skip copying some files (and whole directories!) to server. Samba solves this kind of problem by using 'client code page' and 'character set' options. I propose somewhat simpler solution using translation table between local and remote file system. I have developed a patch to address the problem, which basically does this: - adds command line option --filename-translation (options.c) - builds two way character translation lookup table in memory (512 bytes) (utils.c) - translates filenames at appropriate places (sender.c, flist.c) is --filename-translation is present Note this patch can't handle multibyte encodings. The performance impact of translation should be negliable, especially if not active. The patch changes multiple files and is rather long so I'd like to open a discussion before posting. There has been some interest in that topic before here (http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and also on some other, local mailing lists. Since inability to copy all files renders rsync unusable to non-latin-1 users I would like to hear some comments about including the patch into main source tree (or proposing a better solution, of course). Bye, Savin Gorup -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Filename character translation
[EMAIL PROTECTED](Savin Gorup) 07.12.02 11:54 wrote: I came across the problem with rsync-2.5.5 on Cygwin/Win2K while rsyncing with filenames which have 'strange' (non latin-1) characters in filenames. The problem is that filenames on Windows system are coded (in our case) in codepage 852, while server (Linux system) has filename coding according to ISO-8859-2. This two are not fully compatible, causing rsync to simply skip copying some files (and whole directories!) to server. Samba solves this kind of problem by using 'client code page' and 'character set' options. but rsync does not work on such funny chars in samba dirs either! At least 2.5.5 on SCO OSR5 failed. Thought it was a SCO problem (rdist did work either) so i went back to cpio to make remote backup work. Maybe i can find the error messages somewhere. IIRC rsync simply stops working on the first file with an Umlaut (U (0x9A)) and continues with the next directory... If someone is not exactly comparing the results -every time- he would not become aware the problem: One (samba) user might create such a filename meanwhile, and since that day only the half directory is backed up... I propose somewhat simpler solution using translation table between local and remote file system. I have developed a patch to address the problem, which basically does this: - adds command line option --filename-translation (options.c) - builds two way character translation lookup table in memory (512 bytes) (utils.c) - translates filenames at appropriate places (sender.c, flist.c) is --filename-translation is present Note this patch can't handle multibyte encodings. That's a problem: The normal NT _findfirst translates all(!) unicodes 0xff to 0x3f ? AFAIK. On the Unix box the ? (=wildcard!) in the file name gives no problem. But restore will be impossible, because ? is no legal character on NTFS/FAT... Too your mapping fails, because all unicode chars are already mapped when rsync see it (if not _tfindfirst is used!). But: i don't know what the cygwin-API is doing. Maybe it does better than NT? There has been some interest in that topic before here (http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and also on some other, local mailing lists. Since inability to copy all files renders rsync unusable to non-latin-1 users Yepp. Was very disappointed about that, but had have no time to work on the problem.. I would like to hear some comments about including the patch into main source tree (or proposing a better solution, of course). I would be happy if rsync would be able to copy samba shares between unixes... Wasn't that problem already been solved for CD filesystems? (Rockridge extensions?) Thanks for bringing the problem to the list! -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Filename character translation
[EMAIL PROTECTED](Savin Gorup) 07.12.02 11:54 Once upon a time Savin Gorup shaped the electrons to say... There has been some interest in that topic before here (http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and From: Martin Bene Subject: rsync 2.5.1 on NT/cygwin: can't handle filenames with non-latin1 character set Date: Sat, 09 Mar 2002 05:25:18 -0800 when using rsync on NT: rsync can't handle filenames with strange $ rsync -av /cygdrive/c/data/transfer/Marisa/ Marisa/ building file list ... readlink Imagelep. 10?1: No such file or directory readlink Imagelep. 11?2: No such file or directory readlink Imagelep. 9?1: No such file or directory done That sounds very much beeing the API problem of _findfirst converting unicode to 0x3F Yes, NT-_findfirst delivers filenames which are unusable for a fileopen! One attempt, at least to be able to open that file from 8bit world, maybe using the 8.3 name mangeling.. But on long directories that is very bad for the performance. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html