Re: ICUFoldingFilter

Robert Muir Mon, 04 Jun 2018 18:12:52 -0700

There may be a traps, e.g. if you make such a filter with UnicodeSet,
I think you really need to call .freeze() before passing it to this
thing. I have not examined the sources in a while but I think this
might be similar to "compiling a regexp" in that you'll then get good
performance when its later called millions of times.


If you use the factories, it will do this for you. But if you use the
API directly it is currently a bit of a performance trap...

On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov <[email protected]> wrote:
> Ah thanks! That's very good to know. As it is I realized we already have an
> earlier component where we can handle this (we have a custom ICUTokenizer
> rbbi and can just split on "^"). So many flexibility
>
> -Mike
>
> On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir <[email protected]> wrote:
>
>> actually, you now can choose to ignore certain characters by using
>> unicode filtering mechanism.
>>
>> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>>
>> So apply a filter such as [^\^] and the filter will ignore ^.
>>
>> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir <[email protected]> wrote:
>> > This cannot be "tweaked" at runtime, it is implemented as custom
>> normalization.
>> >
>> > You can modify the sources / build your own ruleset or use a different
>> > tokenfilter to normalize characters.
>> >
>> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov <[email protected]>
>> wrote:
>> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
>> what I
>> >> want. However there are some behaviors I'd like to tweak. For example it
>> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
>> and
>> >> whether there is any way to prevent it.
>> >>
>> >> I spent a little time with
>> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
>> guess
>> >> is the basis for what this filter does (it's referenced in the
>> javadocs),
>> >> but that didn't answer my questions. As an aside, it seems this tech
>> report
>> >> was withdfrawn by the unicode consortium? Not sure what that means if
>> >> anything, but it seems ominous.
>> >>
>> >> Anyway, I would appreciate pointers to more info, and specifically,
>> whether
>> >> there are any alternatives to the utr30.nrm data file, or any
>> possibility
>> >> to select among the many transformations this filter applies.
>> >>
>> >> Thanks!
>> >>
>> >> Mike S
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ICUFoldingFilter

Reply via email to