I think distingushing between UTF-8 and ISO8859-? codesets by
examining byte values or patterns used in file names is quite difficult
and not always possible. I'd be interested to hear from you on
what would be the best way of achieving that.
Ienup
PS. BTW, I think we have about three (or perhaps more) file name length
restrictions or constraints and then various problems stem out from
any possible combinations out of the three:
- Different locales/codesets use different number of bytes to
represent the same characters.
- Multiple user land side max filename length definitions.
- Multiple per file system max filename length definitions.
And I think we already have this problem of "mismatch" in terms of
the number of bytes and the number of characters allowed in file names in
various levels vertically and horizontally.
While people usually don't see the problem so often (since not that many
people create and use really lengthy file names daily), I think the problem
does exist in today and that's not just on traditional file systems
such as UFS but also with rather new Unicode file systems such as NTFS,
HFS+, UDF, and so on since, for instance, NTFS allows 255 16-bit units for
a filename and it can be translated into 255 UTF-16 characters or
127 UTF-16 characters. The similar for UDF; it could be 127 or 254
characters depending on what is the compression id used with.
Joerg Schilling wrote at 11/30/06 07:06:
Ienup Sung <[EMAIL PROTECTED]> wrote:
This would at least create incompatibilities with long filenames.
ISO-8859-1 is the low 8 bits of UNICODE and if I use a ISO-8859-1
coded filename I am currently able to have up to 255 ISO-8859-1
characters in a filename.
After e.g. UFS is converted to UTF-8, the max file name length depends
on the content of a filename and may be reduced to only 127 ISO-8859-1
characters. As a result, you may be unable to restore a backup.
Jörg
That's true and so migration to UTF-8 FS and switch on or off of
the feature is not mandatory. The migration is also aided by tools such as
fsexam(1) that people can do dry run and customization after the dry run.
As ISO-8859-1 is the low 8 bit of UNICODE, it would be possible to
store UNICODE or ISO-8859-1 in a directory entry and to distinct
UTF-8 from ISO-8859-1. In this case, you need to allow longer file names than
255 bytes with lookuppn() but this is needed anyway for Joliet where you
would need up to 330 bytes + null character in order to store a
110 character file name that uses katakana only.
Jörg
_______________________________________________
opensolaris-discuss mailing list
[email protected]