> On Mon, 2010-11-01 at 10:36 +0100, Joerg Schilling
> wrote:
> > "Garrett D'Amore" <garr...@damore.org> wrote:
> > 
> > > On Mon, 2010-10-25 at 00:43 -0700, shilpa wrote:
> > > > How would glob feature work, in case multibyte
> filenames are allowed? Because a multibyte character
> is a combination of more than one character, which
> includes glob characters like "{", "["....
> > >
> > > I'm not sure how globbing would work, frankly.
>  However, I believe UTF8
> > and other common multibyte character schemes
>  always have bytes with the
> > high-order bit set, so that there is never a
>  multibyte character that
> > has component bytes that collide with ASCII.  So
>  this problem should be
>  > a non-issue.
>  
> The assumption that multi-byte characters use octets
>  with the high order bit 
>  set is only correct for so called stateless locales.
>  
>  Locales that use shift codes behave different.
> Actually, its a safe assumption for UTF-8, which is
> the main concern I
> think.
> 
> The bigger question here is not locales, but
> character encoding schemes,
> I think.  Specifically we're talking about filenames,
> which do not
> inherently carry a locale with them, but might be
> encoded in one of a
> small number of locales... for UTF-8 I believe the
> code is fine.
> 
> Also, if the code is using libc's glob interfaces,
> its fine too, because
> libc's glob code is sensitive to the locale and
> correctly handles
> stateful encodings.
> 
>       - Garrett

Between any two systems such that the filesystems in question
are content to store arbitrary bytes in the name (other than
/ and '\0' of course), and where the commands including the
filenames are passed 8-bit clean, I'd expect the name would
be preserved, assuming that a UTF-8 encoding is used to read
the filename on both ends.

Globbing depends on the capabilities of the system
that's doing the expansion (whichever is sending, I imagine, ie
remote for mget, local for put).  Ideally all should convert to
UTF-8 to send the filenames, and if needed from UTF-8 to store
them.  But that doesn't happen, AFAIK.  On most Unix systems,
if you filenames are in UTF-8, it should just work.  But some
filesystems on those OS's may not support UTF-8.  FAT filesystems
probably won't; NTFS (if supported) is AFAIK in UTF-16; not
sure what the limits are on hsfs without Rock Ridge extensions, etc.

Given the present sorry state of I18n support in almost _all_
ftp clients and servers, it's pretty good that it sort of works
between Unix systems when the files on both ends will be going
to/from filesystems that can handle UTF-8 names.

As I mentioned elsewhere in this or a related thread, there
is allegedly at least one open-source ftp server that purports
to support modern protocol extensions for doing this correctly:
http://www.pro-bono-publico.de/projects/ftpd.html

I still haven't found a command-line ftp client that's serious
about i18n.

At least that's how I'd understand it all...
-- 
This message posted from opensolaris.org
_______________________________________________
opensolaris-code mailing list
opensolaris-code@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to