Re: Q to unix filesystem developers

Jonathan Leffler Thu, 14 Apr 2011 16:01:05 -0700

On Thu, Apr 14, 2011 at 13:04, William A. Rowe Jr. <[email protected]>wrote:


> With some multibyte character sets, it may be possible that '/' is one
> byte of a multibyte sequence.  From a Unix perspective, I presume that
> it is always treated a path separator and never treated as a multibyte
> combination filename character.
>
> But I just wanted to ask in case anyone is aware of where this might
> treated as a valid filename character?
>


Wikipedia on Shift-JIS (http://en.wikipedia.org/wiki/Shift_JIS) says:

*Shift JIS* (also *SJIS*, MIME <http://en.wikipedia.org/wiki/MIME> name *
Shift_JIS*) is a character
encoding<http://en.wikipedia.org/wiki/Character_encoding>for the
Japanese
language <http://en.wikipedia.org/wiki/Japanese_language> originally
developed by a Japanese <http://en.wikipedia.org/wiki/Japan> company
called ASCII
Corporation <http://en.wikipedia.org/wiki/ASCII_%28company%29> in
conjunction with Microsoft <http://en.wikipedia.org/wiki/Microsoft> and
standardized as *JIS X 0208 Appendix 1*. It is based on character sets
defined within 
JIS<http://en.wikipedia.org/wiki/Japanese_Industrial_Standards>standards
JIS
X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201>:1997 (for the single-byte
characters) and JIS X 0208 <http://en.wikipedia.org/wiki/JIS_X_0208>:1997
(for the double byte characters). The lead bytes for the double byte
characters are "shifted" around the 64 halfwidth
katakana<http://en.wikipedia.org/wiki/Katakana>characters in the
single-byte range 0xA1
to 0xDF <http://en.wikipedia.org/wiki/JIS_X_0201#Encoded_Katakana>. The
single-byte characters 0x <http://en.wikipedia.org/wiki/0x>00 to 0x7F match
the ASCII <http://en.wikipedia.org/wiki/ASCII> encoding, except for a
yen<http://en.wikipedia.org/wiki/Japanese_yen>sign at 0x5C and an
overline at 0x7E in place of the ASCII character set's
backslash and tilde respectively. The single-byte characters from 0xA1 to
0xDF map to the half-width katakana characters found in JIS X 0201.

Shift JIS requires an 8-bit clean
<http://en.wikipedia.org/wiki/8-bit_clean>medium for transmission. It
is fully backwards
compatible <http://en.wikipedia.org/wiki/Backward_compatibility> with the
legacy JIS X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201> single-byte
encoding <http://en.wikipedia.org/wiki/Single-byte_encoding>, meaning it
supports half-width
katakana<http://en.wikipedia.org/wiki/Half-width_katakana>and that any
valid JIS X 0201 string is also a valid Shift JIS string. For
two-byte characters, however, Shift JIS only guarantees that the first byte
will be high bit set (0x80–0xFF); the value of the second byte can be either
high or low. Appearance of byte values 0x40–0x7E as second bytes of code
words <http://en.wikipedia.org/wiki/Code_word> makes reliable Shift JIS
detection difficult, because same codes are used for ASCII characters. On
the other hand, the competing 8-bit format
EUC-JP<http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP>,
which does not support single-byte halfwidth katakana, allows for a much
cleaner and direct conversion to and from JIS X 0208 code
points<http://en.wikipedia.org/wiki/Code_point>,
as all high bit set bytes are parts of a double-byte character and all codes
from ASCII range represent single-byte characters.

Given that the second byte is in the range 0x40..0x7E (second para), and /
is 0x2F, there shouldn't be a problem with Shift-JIS.  That's not to say
there isn't another codeset where there isn't a problem, but I don't think
it is Shift-JIS and possibly not any of the main Japanese codesets.


-- 
Jonathan Leffler <[email protected]>  #include <disclaimer.h>
Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org
"Blessed are we who can laugh at ourselves, for we shall never cease to be
amused."

Re: Q to unix filesystem developers

Reply via email to