i18nregexp replaced with ICU regexp => heads up

Herbert Duerr Fri, 30 Sep 2011 06:08:32 -0700

Hi,

for removing "category X excluded licenses" from Apache OpenOffice Ireplaced the formerly used LGPL licensed module i18nregexp with theregular expression engine of module ICU which is already widely use inOpenOffice.

The replacement fixes a lot of problems: e.g. in a text "abcabc" tryingto "find all backwards" for "b" resulted in it only finding the last"b", now it actually finds all of them. It also introduces some changes,e.g. i18nregexp had two modes "classic" and "extended" regexp whereasthe ICU based engine treats all patterns as extended-regexp.

I18nregexp used an approach where it transliterated and compared eachcodepoint pair of the pattern and text string. The new engine does thetransliteration only once per pattern and text string. This is muchfaster, but it only works because the transliteration was tweaked topreserve the special regexp control characters.

The reporters of any issues in the lists below are encouraged to checkthe problems they saw with the new engine.

https://issues.apache.org/ooo/buglist.cgi?quicksearch=regexp
https://issues.apache.org/ooo/buglist.cgi?quicksearch=regular\ expression

Please make sure to have the "More Options -> Regular Expressions"checkbox activated for testing.

I'm afraid the regexp replacement resulted in changes mostly forJapanese users, because there a lot of non-trivial transliterations areactive. For reference I'm enumerating the active rules:"ProlongedSoundMark", "IterationMark", "Ignore-Width", "BaFa", "SeZe","HyuByu", "IandEfollowedByYa" and "KiKuFollowedBySa".


Herbert

i18nregexp replaced with ICU regexp => heads up

Reply via email to