Re: [pcre-dev] Ignoring a whole set of unicode characters

Ze'ev Atlas Thu, 26 Mar 2015 12:01:06 -0700

Hi PhilipI consider that to be a major issue for non-European languages, at 
least those that distinguish between the basic consonant letters and additional 
characters that may add information but are not necessary (I gave you to 
scripts in which such characters would be implied whether they are there or 
not.)I know that you abide by the Perl 'standard', so I guess my question would 
be where do I go to propose such a change, let's say something like (similar to 
/i for ignore case in Perl):
I use D for 'Disregard'if $x=~/some elaborate pattern with Unicode 
characters/D{UNICODE CLASS NAME}) #  or D{list of code points}
which will mean:match the pattern /some elaborate pattern with Unicode 
characters/, but whenever you see a disregarded character behave like it was 
not there at all!
Another possibility would be to tell that list to the engine in advance via 
some general parameter.  What I won't want is to have that as a compile option 
to the C when it compiles the engine.How hard would it be to implement 
something like this?
 Ze'ev Atlas



      From: "[email protected]" <[email protected]>


I am not expert on this kind of thing, but doesn't \X do some of what 
you want? It will, for example, match the pair 05d0,05b8. It matches 
what Unicode calls an "Extended Grapheme Cluster". See the description 
in the pcre[2]pattern page for details.

If you wanted to match 05d0 plus any following mark characters you could 
write this:  (?=\x{5d0})\X  which is clumsy, but I don't know of 
anything neater.

Philip

-- 
Philip Hazel

  
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Ignoring a whole set of unicode characters

Reply via email to