Yes that sounds good -- this ConditionalTokenFilter is going to be very
helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
around and see about incorporating the emoji rules from there.  Thanks
Robert

On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcm...@gmail.com> wrote:

> > Any thoughts?
>
> best idea I have would be to tokenize with ICUTokenizer, which will
> tag emoji sequences as "<EMOJI>" token type, then use
> ConditionalTokenFilter to send all tokens EXCEPT those with token type
> of  "<EMOJI>" to your WordDelimiterFilter. This way
> WordDelimiterFilter never sees the emoji at all and can't screw them
> up.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to