I'm somewhat confused about what this enhancement asks for. Here's a few points that may need to be clarified.
1. Would different byte order sequences need to be accommodated? There may be little or big endian, and perhaps always assuming one might not always be appropriate. Picking up a value with *(unsigned short int) could yield the wrong interpretation. ME: I think PCRE should only do the native endian of the machine that it's combiled on. x86 happens to be little endian. If any cross-endian support is needed then it should be left to the application program to swap the data byte pairs. 2. Would there be consideration for data that begins with a UTF-16 byte order mark? When present, it indicates the endian of the data. "\xFF\xFE" (little endian), "\xFE\xFF" (big endian). Would that be skipped over and ignored, treated as data, or would that be actually used by the PCRE library to swap its treatment of the byte order? ME: I think PCRE should treat a byte order mark as data and not even try to detect it. Where byte order marks exist, it should be the application program's responsibility to accommodate them. 3. Would it be acceptable if the search pattern given to PCRE is always be ASCII or UTF-8? Does PCRE need to accommodate UTF-16 in the search pattern too? The desire is to support UTF-16, but how well can that be done without UTF-16 in the search pattern? I think PCRE's use of a UTF-16 (in native endian) search pattern may be what is being requested. But I'm not sure. There may be some assumed functionality that needs to be clarified. Instead of an ASCII pattern such as "\sabc[def]" someone could manipulate it to be "\x00\s\x00a\x00b\x00c\x00[def]". There, is that enough "support for UTF-16"? Probably not? If it was then there could be discussion of pattern alteration schemes without overhauling the underlying engines, and then perhaps optimization enhancements to speed searching with such patterns if that was appropriate. A module concerned with translating a UTF-16 search pattern into something that can search UTF-16 data using the current PCRE engine should also accommodate big vs. small endian in the resultant pattern. Matching of UTF-16 data probably(?) involves the ability to specify search arguments that contain native UTF-16 code points. Yet for that to happen PCRE would have to parse a UTF-16 "string" to identify the regular expression syntax and semantics. The compile() process may need to be told whether it's a compilation for ASCII/UTF-8 or one for UTF-16. You might also want it to be told the strlen() of the pattern or the common UTF-16 single char zero could be misinterpreted. Let's consider pcregrep as being an application program. With UTF-16 support in the PCRE base library then that program should be able to: + Look for a UTF-16 byte order mark at the start of a file. If none, proceed without UTF-16 involvement. It might also test for (and perhaps skip over to ignore) the UTF-8 byte order mark "\xEF\xBB\xBF" (3 bytes). + If it detects a UTF-16 byte order mark that is different than the native compile, the program should swap all data byte pairs read from that file. + But what about the pcregrep command-line parameters? Should the search pattern be specified as UTF-8 and converted to UTF-16 for possible use in scanning files that have a UTF-16 byte order mark? Does supporting UTF-16 in pcregrep involve compiling twice - once treating the pattern as ASCII and a different one having a UTF-16 pattern for the cases where UTF-16 content is detected? The people asking for UTF-16 support should be more precise about what is being requested and what their expectation would be. There may be issues related to PCRE compile() as well as exec(). Or perhaps the implementation considerations should be more focused on UTF-16 search patterns rather than methods of traversing UTF-16 data. References: UTF-8 http://www.ietf.org/rfc/rfc2781.txt UTF-16 http://www.ietf.org/rfc/rfc2781.txt PS - I'm a dis-interested party other than the potential impact to the base PCRE library. Regards, Graycode -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
