Yes, we have numerous locales with different codesets. Solaris 10,
as an example, we have 165 locales with 23 different codesets.
In many cases, codesets use quite similar representation forms and yet
the mappings between the code point values and actual characters/glyphs
are quite different.

Underlying file systems also have various ways of depositing characters
althought many new file systems are converging to Unicode. (Even then, among
those rather new file systems that use Unicode, they use sometimes
different Unicode encodings not entirely compatible with others byte by
byte.)

To solve the problem of not correctly showing non-ASCII characters and yet
keeping the maximum compatibility with existing applications and also
numerous locales and codesets it appears that either we tag codeset for
each file or adopt Unicode, in particular, UTF-8, as the file system codeset
as the one thing and then add/doing transparent codeset conversion as
the other. These two could go together or separately supported too.

Re the file name length, having a big enough one will obviously
help as long as there is a clear way to keep the backward compatibility
and also with minimal breakage. Sticking to the current user land
length definitions is also another way, i.e., no change regarding
the length for the existing (traditional) file systems.

Ienup

Joerg Schilling wrote at 11/30/06 12:59:
Ienup Sung <[EMAIL PROTECTED]> wrote:


I think distingushing between UTF-8 and ISO8859-? codesets by
examining byte values or patterns used in file names is quite difficult
and not always possible. I'd be interested to hear from you on
what would be the best way of achieving that.


I did not think about other codings but only about ISO-8859-1 as it is the most popular single byte coding. I did not yet think about the idea completely. But there is another idea to (mostly) deal with the problem.
See below....



PS. BTW, I think we have about three (or perhaps more) file name length
restrictions or constraints and then various problems stem out from
any possible combinations out of the three:

- Different locales/codesets use different number of bytes to
  represent the same characters.
- Multiple user land side max filename length definitions.
- Multiple per file system max filename length definitions.


In order to avoid unneeded problems, I recommend to change MAXNAMELEN in
usr/src/uts/common/fs/lookup.c (and maybe a few other files) to 1024.
This would already allow to use hsfs with Joliet without limitations.
If you like to test, use the undocumented mount option "jolietlong" and
a Joliet CD with very long file names. If you do not change lookup.c,
you will be able to see long file names (up to 330 bytes - 110 UCS-2 chars) but not to stat/open the files. If you change MAXNAMELEN, you may also use them.


And I think we already have this problem of "mismatch" in terms of
the number of bytes and the number of characters allowed in file names in
various levels vertically and horizontally.

While people usually don't see the problem so often (since not that many
people create and use really lengthy file names daily), I think the problem
does exist in today and that's not just on traditional file systems
such as UFS but also with rather new Unicode file systems such as NTFS,
HFS+, UDF, and so on since, for instance, NTFS allows 255 16-bit units for
a filename and it can be translated into 255 UTF-16 characters or
127 UTF-16 characters. The similar for UDF; it could be 127 or 254
characters depending on what is the compression id used with.


If NTFS allows 255 UTF-2 chars, you need to set MAXPATHNAME to at least 765.

I did not check ZFS on disk structures, but on UFS MAXPATHNAME could be enhanced to 503 to allow longer UNICODE names. If MAXPATHNAME is 503, then
we could allow 251 ISO-8859-1 chars from the 8-bit range or 167 katakana
characters.


Jörg

_______________________________________________
opensolaris-discuss mailing list
[email protected]

Reply via email to