Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-10 Thread Sastry
On 8/9/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 Hello,
 
 On Tue, 9 Aug 2005 15:09:42 +0530, Sastry [EMAIL PROTECTED] wrote
  Hi
 
  As suggested by you, I ran the following script which resulted in
  substituting all the characters with X irrespective of the special
  case [i-j].
 
  ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
  is($a, );
 
 Right, that behavior of ranges in character classes [ ] is expectable
 from literal_endpoint, which is introduced by Change 16556.
 
 cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556
 
  I have also observed that whenever there are any gapped characters eg:
  [r-s] as in the following script, it just translates 'r' and 's' to X
  alone!
 
  ($a = \x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2) =~ tr/\x99-\xa2/X/;
  is($a, XX);
 
  a) Why is it mentioned that when [i-j] is included [\x89-\x91] should
  not be included?
  b) Do you think there is a bug in the tr// implementation as a
  consequence of the above?
 
  -Sastry
 
 Answer for a) is mentioned in perlebcdic.pod.
 The last sentence (This works in...) seems to be added there
 in accompanied with Change 16556 as above.
 
 +++quote begin
 REGULAR EXPRESSION DIFFERENCES
 As of perl 5.005_03 the letter range regular expression such as [A-Z]
 and [a-z] have been especially coded to not pick up gap characters.
 For example, characters such as o WITH CIRCUMFLEX that lie between I
 and J would not be matched by the regular expression range /[H-K]/.
 This works in the other direction, too, if either of the range end
 points is explicitly numeric: [\x89-\x91] will match \x8e, even though
 \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
 viewpoint.
If I specify  [\x89-\x91]  it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct? 
-Sastry

 quote end
 
 I'll give some additional explanations from the viewpoint
 of portability:
 a letter range [h-k] always means [hijk], even on EBCDIC platforms,
 but not [hi\x8A-\x90jk], because the string h is always the small
 letter 'h' whether its code value is 0x68 or 0x88;
 thus a numeric range [\x89-\x91] should always mean
 [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms,
 but not [\x89\x91], because the string \x89 always stands for
 the code value 0x89 whether it encodes a certain C1 control character
 or the letter 'i'.
 
 b): In my opinion the above change in [  ] for regular expressions
 is an improvement and a similar change in tr/// is also advisable.
 
 The reason why I hesitate to use the word bug is based on
 the following statement on tr/// in perlop.pod, esp. the last sentence:
 
 +++quote begin
 Note also that the whole range idea is rather unportable between
 character sets--and even within character sets they may cause results
 you probably didn't expect. A sound principle is to use only ranges
 that begin from and end at either alphabets of equal case (a-e, A-E),
 or digits (0-4). Anything else is unsafe. If in doubt, spell out
 the character sets in full.
 quote end
 
 where numeric ranges such as \x89-\x91 are not declared
 to be safe, but to be unsafe.
 
 Regards,
 SADAHIRO Tomoyuki
 
 



Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-09 Thread SADAHIRO Tomoyuki

On Mon, 8 Aug 2005 15:36:40 +0100, Nicholas Clark [EMAIL PROTECTED] wrote

 On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote:
  Hi
  
  I am trying to run this script on an EBCDIC platform using perl-5.8.6
   
  ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
  is($a, );
  
  
  The result I get is 
  
   'X«»ðý±°X'
  
  a) Is this happening  since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped
  characters in EBCDIC ?
 
 I think so. In that \x89 is 'i' and \x91 is 'j'.
 
 
  b) Should all the bytes in $a change to X?
 
 I don't know. It seems to be some special case code in regexec.c:
 
 #ifdef EBCDIC
   /* In EBCDIC [\x89-\x91] should include
* the \x8e but [i-j] should not. */
   if (literal_endpoint == 2 
   ((isLOWER(prevvalue)  isLOWER(ceilvalue)) ||
(isUPPER(prevvalue)  isUPPER(ceilvalue
   {
   if (isLOWER(prevvalue)) {
   for (i = prevvalue; i = ceilvalue; i++)
   if (isLOWER(i))
   ANYOF_BITMAP_SET(ret, i);
   } else {
   for (i = prevvalue; i = ceilvalue; i++)
   if (isUPPER(i))
   ANYOF_BITMAP_SET(ret, i);
   }
   }
   else
 #endif
 
 
 which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not.
 I don't know where ranges in tr/// are parsed, but given that I grepped
 for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91//
 is treated as tr/i-j// and in turn i-j is treated as letters and always
 special cased

S_scan_const() in toke.c seems to expand ranges in tr///,
while S_regclass() in regcomp.c (what I assume you mean) copes
with those in []. 

 from toke.c, line 1419
#ifdef EBCDIC
if ((isLOWER(min)  isLOWER(max)) ||
(isUPPER(min)  isUPPER(max))) {
if (isLOWER(min)) {
for (i = min; i = max; i++)
if (isLOWER(i))
*d++ = NATIVE_TO_NEED(has_utf8,i);
} else {
for (i = min; i = max; i++)
if (isUPPER(i))
*d++ = NATIVE_TO_NEED(has_utf8,i);
}
}
else
#endif

The former doesn't have thing like literal_endpoint in the latter;
thus tr/// seem not to tell literals from metacharacters in ranges
and tr/\x89-\x91/X/ will not replace \x8e in EBCDIC.

Hmm, it may be a possible inconsistency in the case of EBCDIC.
Sastry, would you please do the following codelet on your EBCDIC?

($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
 is($a, );

Does that work similarly to yours?
($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
 is($a, );

Regards,
SADAHIRO Tomoyuki