[pcre-dev] Ignoring a whole set of unicode characters

Ze'ev Atlas Wed, 25 Mar 2015 21:07:35 -0700

I would like to ask whether it is possible or implement in PCRE the possibility 
to tell the regex to ignore a whole set of unicode characters.  Let me please 
explain, in some languages, Arabic and Hebrew are main examples, there are 
unicode characters that are considered to be vowels.  They do not exist on 
their own, do not take a space in print but could be applied to the base 
character.  In most cases, those vowels are dropped and not used at all since 
the readers know from the context what would they be, but in other cases they 
are applied.  As opposing to diacritics, they are NEVER considered a part of 
the base character.Example: code point 05d0 is HEBREW LETTER ALEF, but the two 
codepoints 05d0,05B8 next to each other are HEBREW LETTER ALEF with HEBREW 
POINT QAMATS.  Now, in many cases, all we need to search are the base 
characters without taking care of all possible such combinations.  What I need 
is a way to tell the regex engine something like: "when you see any of the 
characters between code point 05a0 and 05c7 (that represent the classes 'Hebrew 
Cantillation marks', 'Points and punctuation' and 'Puncta extraordinaria') 
ignore them as if they were not there at all".  Similar functionality could be 
in Arabic code points 066a through 066d (Punctuation, there are more groups 
like this.) Ze'ev Atlas


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] Ignoring a whole set of unicode characters

Reply via email to