Hi,

for removing "category X excluded licenses" from Apache OpenOffice I replaced the formerly used LGPL licensed module i18nregexp with the regular expression engine of module ICU which is already widely use in OpenOffice.

The replacement fixes a lot of problems: e.g. in a text "abcabc" trying to "find all backwards" for "b" resulted in it only finding the last "b", now it actually finds all of them. It also introduces some changes, e.g. i18nregexp had two modes "classic" and "extended" regexp whereas the ICU based engine treats all patterns as extended-regexp.

I18nregexp used an approach where it transliterated and compared each codepoint pair of the pattern and text string. The new engine does the transliteration only once per pattern and text string. This is much faster, but it only works because the transliteration was tweaked to preserve the special regexp control characters.

The reporters of any issues in the lists below are encouraged to check the problems they saw with the new engine.
https://issues.apache.org/ooo/buglist.cgi?quicksearch=regexp
https://issues.apache.org/ooo/buglist.cgi?quicksearch=regular\ expression
Please make sure to have the "More Options -> Regular Expressions" checkbox activated for testing.

I'm afraid the regexp replacement resulted in changes mostly for Japanese users, because there a lot of non-trivial transliterations are active. For reference I'm enumerating the active rules: "ProlongedSoundMark", "IterationMark", "Ignore-Width", "BaFa", "SeZe", "HyuByu", "IandEfollowedByYa" and "KiKuFollowedBySa".

Herbert

Reply via email to