------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=791 Summary: UTF-8 support does not work on EBCDIC platforms Product: PCRE Version: 7.8 Platform: Other OS/Version: All Status: NEW Severity: bug Priority: medium Component: Code AssignedTo: [email protected] ReportedBy: [email protected] CC: [email protected] We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses EBCDIC as its character encoding. We pass only UTF-8 strings to PCRE using the PCRE_UTF8 flag, both as patterns and subject strings, because our internal string class is based on UTF-8. We configured PCRE with both --enable-utf8 and --enable-ebcdic, and also created the appropriate EBCDIC character table with dftables. Tests showed that compiling UTF-8 patterns failed. The pattern we used resulted in an 'unmatched parentheses' error but the exact error does not matter. The pattern was of course well-formed and worked on non-EBCDIC platforms. I dug deeper into the PCRE source, and I think I found the cause of the problem: PCRE processes the pattern by either reading the characters directly from the pattern buffer as compile_branch() does which I analyzed, or using the GETCHAR macros from pcre_internal.h. No matter how the character is retrieved, it is always in the same codepage as the pattern passed to pcre_compile(). In our case this is UTF-8. The real problem is now that the PCRE code compares the characters taken from the pattern with normal C character literals such as '\\', '*', etc. This works fine on non-EBCDIC (ASCII) platforms because the 7-bit ASCII subset of UTF-8 is identical to ASCII. On an EBCDIC platform such character literals will never match with UTF-8 characters because the characters in the "ASCII subset" of UTF-8 and EBCDIC have different codes. For example the backslash has code 0x5C in ASCII and UTF-8 but 0xE0 or 0xEC on EBCDIC, depending on the EBCDIC codepage, so the C compiler would generate code to compare the retrieved UTF-8 character with 0xE0 or 0xEC. I see three possible solutions to this problem: - On EBCDIC platforms, do not use C character literals in the PCRE code but use lookup tables to get the appropriate character code, one for non-UTF-8 (EBCDIC) mode, the other one for UTF-8 mode. The lookup tables would return either the EBCDIC code of a character or the UTF-8 code. The lookup tables would have to be user-definable because as I mentioned above some meta characters like the backslash have different codes in different EBCDIC codepages. - Convert the pattern characters to EBCDIC as they are retrieved with one of the GETCHAR macros. In this way they could be compared to C character literals. This would work for the meta characters but what about other characters from the pattern which cannot be represented in EBCDIC but only in Unicode? - Convert the entire pattern from UTF-8 to EBCDIC before compiling it. This is clearly undesirable because it would negate all advantages of using UTF-8. I know PCRE not well enough to be able to say if the subsequent matching would even work in this case. I am writing this bug report mainly to get input how to solve the problem best. I am aware that you probably do not have access to an EBCDIC platform and that I would have to code a fix myself. BTW, Bugzilla does not offer appropriate choices for Hardware (zSeries) and OS (z/OS), so I had to use wrong ones. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
