On 8/9/05, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> wrote: > Hello, > > On Tue, 9 Aug 2005 15:09:42 +0530, Sastry <[EMAIL PROTECTED]> wrote > > Hi > > > > As suggested by you, I ran the following script which resulted in > > substituting all the characters with X irrespective of the "special > > case" [i-j]. > > > > ($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ s/[\x89-\x91]/X/g; > > is($a, "XXXXXXXX"); > > Right, that behavior of ranges in character classes [ ] is expectable > from literal_endpoint, which is introduced by Change 16556. > > cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556 > > > I have also observed that whenever there are any gapped characters eg: > > [r-s] as in the following script, it just translates 'r' and 's' to X > > alone! > > > > ($a = "\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2") =~ tr/\x99-\xa2/X/; > > is($a, "XXXXXXXXXX"); > > > a) Why is it mentioned that when [i-j] is included [\x89-\x91] should > > not be included? > > b) Do you think there is a bug in the tr// implementation as a > > consequence of the above? > > > > -Sastry > > Answer for a) is mentioned in perlebcdic.pod. > The last sentence ("This works in...") seems to be added there > in accompanied with Change 16556 as above. > > +++quote begin > REGULAR EXPRESSION DIFFERENCES > As of perl 5.005_03 the letter range regular expression such as [A-Z] > and [a-z] have been especially coded to not pick up gap characters. > For example, characters such as o WITH CIRCUMFLEX that lie between I > and J would not be matched by the regular expression range /[H-K]/. > This works in the other direction, too, if either of the range end > points is explicitly numeric: [\x89-\x91] will match \x8e, even though > \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic > viewpoint. If I specify [\x89-\x91] it just matches the end characters (i,j) and doesn't match any of the gapped characters( including \x8e), unlike what you had mentioned. Is this correct? -Sastry
> ----quote end > > I'll give some additional explanations from the viewpoint > of portability: > a letter range [h-k] always means [hijk], even on EBCDIC platforms, > but not [hi\x8A-\x90jk], because the string "h" is always the small > letter 'h' whether its code value is 0x68 or 0x88; > thus a numeric range [\x89-\x91] should always mean > [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms, > but not [\x89\x91], because the string "\x89" always stands for > the code value 0x89 whether it encodes a certain C1 control character > or the letter 'i'. > > b): In my opinion the above change in [ ] for regular expressions > is an improvement and a similar change in tr/// is also advisable. > > The reason why I hesitate to use the word "bug" is based on > the following statement on tr/// in perlop.pod, esp. the last sentence: > > +++quote begin > Note also that the whole range idea is rather unportable between > character sets--and even within character sets they may cause results > you probably didn't expect. A sound principle is to use only ranges > that begin from and end at either alphabets of equal case (a-e, A-E), > or digits (0-4). Anything else is unsafe. If in doubt, spell out > the character sets in full. > ----quote end > > where numeric ranges such as \x89-\x91 are not declared > to be safe, but to be unsafe. > > Regards, > SADAHIRO Tomoyuki > > >