On 8/9/05, SADAHIRO Tomoyuki <[EMAIL PROTECTED]> wrote:
> Hello,
> 
> On Tue, 9 Aug 2005 15:09:42 +0530, Sastry <[EMAIL PROTECTED]> wrote
> > Hi
> >
> > As suggested by you, I ran the following script which resulted in
> > substituting all the characters with X irrespective of the "special
> > case" [i-j].
> >
> > ($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ s/[\x89-\x91]/X/g;
> > is($a, "XXXXXXXX");
> 
> Right, that behavior of ranges in character classes [ ] is expectable
> from literal_endpoint, which is introduced by Change 16556.
> 
> cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556
> 
> > I have also observed that whenever there are any gapped characters eg:
> > [r-s] as in the following script, it just translates 'r' and 's' to X
> > alone!
> >
> > ($a = "\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2") =~ tr/\x99-\xa2/X/;
> > is($a, "XXXXXXXXXX");
> 
> > a) Why is it mentioned that when [i-j] is included [\x89-\x91] should
> > not be included?
> > b) Do you think there is a bug in the tr// implementation as a
> > consequence of the above?
> >
> > -Sastry
> 
> Answer for a) is mentioned in perlebcdic.pod.
> The last sentence ("This works in...") seems to be added there
> in accompanied with Change 16556 as above.
> 
> +++quote begin
> REGULAR EXPRESSION DIFFERENCES
> As of perl 5.005_03 the letter range regular expression such as [A-Z]
> and [a-z] have been especially coded to not pick up gap characters.
> For example, characters such as o WITH CIRCUMFLEX that lie between I
> and J would not be matched by the regular expression range /[H-K]/.
> This works in the other direction, too, if either of the range end
> points is explicitly numeric: [\x89-\x91] will match \x8e, even though
> \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
> viewpoint.
If I specify  [\x89-\x91]  it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct? 
-Sastry

> ----quote end
> 
> I'll give some additional explanations from the viewpoint
> of portability:
> a letter range [h-k] always means [hijk], even on EBCDIC platforms,
> but not [hi\x8A-\x90jk], because the string "h" is always the small
> letter 'h' whether its code value is 0x68 or 0x88;
> thus a numeric range [\x89-\x91] should always mean
> [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms,
> but not [\x89\x91], because the string "\x89" always stands for
> the code value 0x89 whether it encodes a certain C1 control character
> or the letter 'i'.
> 
> b): In my opinion the above change in [  ] for regular expressions
> is an improvement and a similar change in tr/// is also advisable.
> 
> The reason why I hesitate to use the word "bug" is based on
> the following statement on tr/// in perlop.pod, esp. the last sentence:
> 
> +++quote begin
> Note also that the whole range idea is rather unportable between
> character sets--and even within character sets they may cause results
> you probably didn't expect. A sound principle is to use only ranges
> that begin from and end at either alphabets of equal case (a-e, A-E),
> or digits (0-4). Anything else is unsafe. If in doubt, spell out
> the character sets in full.
> ----quote end
> 
> where numeric ranges such as \x89-\x91 are not declared
> to be safe, but to be unsafe.
> 
> Regards,
> SADAHIRO Tomoyuki
> 
> 
>

Reply via email to