Hi PhilipI consider that to be a major issue for non-European languages, at
least those that distinguish between the basic consonant letters and additional
characters that may add information but are not necessary (I gave you to
scripts in which such characters would be implied whether they are there or
not.)I know that you abide by the Perl 'standard', so I guess my question would
be where do I go to propose such a change, let's say something like (similar to
/i for ignore case in Perl):
I use D for 'Disregard'if $x=~/some elaborate pattern with Unicode
characters/D{UNICODE CLASS NAME}) # or D{list of code points}
which will mean:match the pattern /some elaborate pattern with Unicode
characters/, but whenever you see a disregarded character behave like it was
not there at all!
Another possibility would be to tell that list to the engine in advance via
some general parameter. What I won't want is to have that as a compile option
to the C when it compiles the engine.How hard would it be to implement
something like this?
Ze'ev Atlas
From: "[email protected]" <[email protected]>
I am not expert on this kind of thing, but doesn't \X do some of what
you want? It will, for example, match the pair 05d0,05b8. It matches
what Unicode calls an "Extended Grapheme Cluster". See the description
in the pcre[2]pattern page for details.
If you wanted to match 05d0 plus any following mark characters you could
write this: (?=\x{5d0})\X which is clumsy, but I don't know of
anything neater.
Philip
--
Philip Hazel
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev