PCRE for classic z/OS is now stable and development effort is basically over.
Before I begin to tackle PCRE2, I would like to tackle the test suite in a more
serious manner regarding the EBCDIC vs. ASCII environment. From previous
communication I know that for 8 bits code I need only TESTIN1, TESTIN2,
TESTIN11 and TESTIN12. I ran those tests as is, knowing that the results would
be different then the results on an ASCII platform. Indeed, there were some
expected differences, but also some unexpected or at least some that I do not
understand. I will have to ask questions as I scan and try to make sense of
those differences.Two comment before my questions:
* The character logical not ¬ is the EBCDIC equivalent of the circumflex ^* I
am somewhat surprised that most tests actually produced the same results.
1. on TESTOUT11-8 line 284 you have:/\x{100}/8BMMemory allocation (code space):
10------------------------------------------------------------------ 0 6 Bra
3 \x{100} 6 6 Ket 9
End------------------------------------------------------------------
/\x{1000}/8BMMemory allocation (code space):
11------------------------------------------------------------------ 0 7 Bra
3 \x{1000} 7 7 Ket 10
End------------------------------------------------------------------/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
I get:/\x{100}/8BMFailed: this version of PCRE is compiled without UTF support
at offset 0
/\x{1000}/8BMFailed: this version of PCRE is compiled without UTF support at
offset 0/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
2. I guess that the test with saved patterns on 16 and 32 bits are not really
relevant because nobody could produce their equivalent on EBCDIC platforms.
3. On TESTOUT2, line 518 you have:/(?i)[abcd]/ISCapturing subpattern count =
0Options: caselessNo first charNo need charSubject length lower bound =
1Starting chars: A B C D a b c d
while I have/(?i)[abcd]/ISCapturing subpattern count = 0Options: caselessNo
first charNo need charSubject length lower bound = 1Starting chars: a b c d A B
C D
The small and capital letter switch places. Now, in EBCDIC, the capital
letters appear after the small letters (e.g. A is 0xC1 while a is 0x81) while
in ASCII the opposite is true. Would that cause the difference? Should it be
like that?
4. I am a bit concerned whether we have the \n defined correctly. In
TESTOUT2, line 689 you have:/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max
lookbehind = 4Contains explicit CR or LF matchOptions: multilineNo first
charNeed char = 'r' foo\nbarbar 0: bar
while I have:/(?<=foo\n)¬bar/ImCapturing subpattern count = 0Max lookbehind =
4Contains explicit CR or LF matchOptions: multilineNo first charNeed char = 'r'
foo\nbarbarNo match
Similarly TESTOUT2, line 1098/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+
)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+
)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/ICapturing
subpattern count = 8Contains explicit CR or LF matchNo optionsFirst char =
'w'Need char = 'd'
while I have:/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+
)((?:[a-zA-Z0-9]+)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+
)((?:[a-zA-Z0-9]+)?)?)?)?)?)?)?)?)?otherword/ICapturing subpattern count = 8No
optionsFirst char = 'w'Need char = 'd'
Ze'ev Atlas
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev