On Thu, Apr 14, 2011 at 13:04, William A. Rowe Jr. <[email protected]>wrote:
> With some multibyte character sets, it may be possible that '/' is one > byte of a multibyte sequence. From a Unix perspective, I presume that > it is always treated a path separator and never treated as a multibyte > combination filename character. > > But I just wanted to ask in case anyone is aware of where this might > treated as a valid filename character? > Wikipedia on Shift-JIS (http://en.wikipedia.org/wiki/Shift_JIS) says: *Shift JIS* (also *SJIS*, MIME <http://en.wikipedia.org/wiki/MIME> name * Shift_JIS*) is a character encoding<http://en.wikipedia.org/wiki/Character_encoding>for the Japanese language <http://en.wikipedia.org/wiki/Japanese_language> originally developed by a Japanese <http://en.wikipedia.org/wiki/Japan> company called ASCII Corporation <http://en.wikipedia.org/wiki/ASCII_%28company%29> in conjunction with Microsoft <http://en.wikipedia.org/wiki/Microsoft> and standardized as *JIS X 0208 Appendix 1*. It is based on character sets defined within JIS<http://en.wikipedia.org/wiki/Japanese_Industrial_Standards>standards JIS X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201>:1997 (for the single-byte characters) and JIS X 0208 <http://en.wikipedia.org/wiki/JIS_X_0208>:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth katakana<http://en.wikipedia.org/wiki/Katakana>characters in the single-byte range 0xA1 to 0xDF <http://en.wikipedia.org/wiki/JIS_X_0201#Encoded_Katakana>. The single-byte characters 0x <http://en.wikipedia.org/wiki/0x>00 to 0x7F match the ASCII <http://en.wikipedia.org/wiki/ASCII> encoding, except for a yen<http://en.wikipedia.org/wiki/Japanese_yen>sign at 0x5C and an overline at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. Shift JIS requires an 8-bit clean <http://en.wikipedia.org/wiki/8-bit_clean>medium for transmission. It is fully backwards compatible <http://en.wikipedia.org/wiki/Backward_compatibility> with the legacy JIS X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201> single-byte encoding <http://en.wikipedia.org/wiki/Single-byte_encoding>, meaning it supports half-width katakana<http://en.wikipedia.org/wiki/Half-width_katakana>and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of code words <http://en.wikipedia.org/wiki/Code_word> makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format EUC-JP<http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP>, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 code points<http://en.wikipedia.org/wiki/Code_point>, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters. Given that the second byte is in the range 0x40..0x7E (second para), and / is 0x2F, there shouldn't be a problem with Shift-JIS. That's not to say there isn't another codeset where there isn't a problem, but I don't think it is Shift-JIS and possibly not any of the main Japanese codesets. -- Jonathan Leffler <[email protected]> #include <disclaimer.h> Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org "Blessed are we who can laugh at ourselves, for we shall never cease to be amused."
