Filename character translation

2002-12-07 Thread Savin Gorup
Hi,

I came across the problem with rsync-2.5.5 on Cygwin/Win2K while rsyncing
with filenames which have 'strange' (non latin-1) characters in filenames.
The problem is that filenames on Windows system are coded (in our case) in
codepage 852, while server (Linux system) has filename coding according to
ISO-8859-2. This two are not fully compatible, causing rsync to simply skip
copying some files (and whole directories!) to server.

Samba solves this kind of problem by using 'client code page' and 'character
set' options. I propose somewhat simpler solution using translation table
between local and remote file system.

I have developed a patch to address the problem, which basically does this:
- adds command line option --filename-translation (options.c)
- builds two way character translation lookup table in memory (512 bytes)
(utils.c)
- translates filenames at appropriate places (sender.c, flist.c)
is --filename-translation is present

Note this patch can't handle multibyte encodings. The performance impact of
translation should be negliable, especially if not active. The patch changes
multiple files and is rather long so I'd like to open a discussion before
posting.

There has been some interest in that topic before here
(http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and also
on some other, local mailing lists. Since inability to copy all files
renders rsync unusable to non-latin-1 users I would like to hear some
comments about including the patch into main source tree (or proposing a
better solution, of course).

Bye,
Savin Gorup

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Filename character translation

2002-12-07 Thread Rainer Zocholl
[EMAIL PROTECTED](Savin Gorup)  07.12.02 11:54 wrote:


I came across the problem with rsync-2.5.5 on Cygwin/Win2K while
rsyncing with filenames which have 'strange' (non latin-1) characters
in filenames. The problem is that filenames on Windows system are
coded (in our case) in codepage 852, while server (Linux system) has
filename coding according to ISO-8859-2. This two are not fully
compatible, causing rsync to simply skip copying some files (and whole
directories!) to server.

Samba solves this kind of problem by using 'client code page' and
'character set' options. 

but rsync does not work on such funny chars in samba dirs either!
At least 2.5.5 on SCO OSR5 failed. Thought it was a SCO problem
(rdist did work either) so i went back to cpio to make 
remote backup work. Maybe i can find the error messages somewhere. 
IIRC rsync simply stops working on the first file with an Umlaut 
(U (0x9A)) and continues with the next directory...
If someone is not exactly comparing the results -every time- 
he would not become aware the problem: One (samba) user might create 
such a filename meanwhile, and since that day only the half directory 
is backed up...



I propose somewhat simpler solution using
translation table between local and remote file system.

I have developed a patch to address the problem, 
which basically does this: 
- adds command line option --filename-translation (options.c)
- builds two way character translation lookup table in memory (512 bytes) 
(utils.c)

- translates filenames at appropriate places (sender.c, flist.c)
is --filename-translation is present

Note this patch can't handle multibyte encodings. 

That's a problem:

The normal NT _findfirst translates all(!) unicodes  0xff 
to 0x3f ? AFAIK.
On the Unix box the ? (=wildcard!) in the file name gives no problem.
But restore will be impossible, because ? is no legal
character on NTFS/FAT... 
Too your mapping fails, because all unicode chars are already 
mapped when rsync see it (if not _tfindfirst is used!).
But:
i don't know what the cygwin-API is doing.
Maybe it does better than NT?



There has been some interest in that topic before here
(http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and
also on some other, local mailing lists. Since inability to copy all
files renders rsync unusable to non-latin-1 users 

Yepp. Was very disappointed about that, but had have no time to work 
on the problem..

I would like to hear some comments about including the patch 
into main source tree (or proposing a better solution, of course).

I would be happy if rsync would be able to copy samba shares 
between unixes...

Wasn't that problem already been solved for CD filesystems?
(Rockridge extensions?)



Thanks for bringing the problem to the list!


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Filename character translation

2002-12-07 Thread Rainer Zocholl
[EMAIL PROTECTED](Savin Gorup)  07.12.02 11:54

Once upon a time Savin Gorup shaped the electrons to say...

There has been some interest in that topic before here
(http://www.mail-archive.com/rsync@lists.samba.org/msg03306.html) and

From: Martin Bene 
Subject: rsync 2.5.1 on NT/cygwin: can't handle filenames with non-latin1 character 
set 
Date: Sat, 09 Mar 2002 05:25:18 -0800 

when using rsync on NT: rsync can't handle filenames with strange 
  
$ rsync -av /cygdrive/c/data/transfer/Marisa/ Marisa/
building file list ... readlink Imagelep. 10?1: No such file or directory
readlink Imagelep. 11?2: No such file or directory
readlink Imagelep. 9?1: No such file or directory
done

That sounds very much beeing the API problem of _findfirst
converting unicode to 0x3F

Yes, NT-_findfirst delivers filenames which are unusable
for a fileopen! 


One attempt, at least to be able to open that file from
8bit world, maybe using the 8.3 name mangeling..
But on long directories that is very bad for the performance.



-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html