If you customized the rules, maybe have a look at
https://issues.apache.org/jira/browse/LUCENE-8366

The rules got simpler and we also updated the customization example
used for the factory's test.

On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <msoko...@gmail.com> wrote:
> Yes that sounds good -- this ConditionalTokenFilter is going to be very
> helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> around and see about incorporating the emoji rules from there.  Thanks
> Robert
>
> On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcm...@gmail.com> wrote:
>
>> > Any thoughts?
>>
>> best idea I have would be to tokenize with ICUTokenizer, which will
>> tag emoji sequences as "<EMOJI>" token type, then use
>> ConditionalTokenFilter to send all tokens EXCEPT those with token type
>> of  "<EMOJI>" to your WordDelimiterFilter. This way
>> WordDelimiterFilter never sees the emoji at all and can't screw them
>> up.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to