Yes that sounds good -- this ConditionalTokenFilter is going to be very helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke around and see about incorporating the emoji rules from there. Thanks Robert
On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcm...@gmail.com> wrote: > > Any thoughts? > > best idea I have would be to tokenize with ICUTokenizer, which will > tag emoji sequences as "<EMOJI>" token type, then use > ConditionalTokenFilter to send all tokens EXCEPT those with token type > of "<EMOJI>" to your WordDelimiterFilter. This way > WordDelimiterFilter never sees the emoji at all and can't screw them > up. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >