If you customized the rules, maybe have a look at https://issues.apache.org/jira/browse/LUCENE-8366
The rules got simpler and we also updated the customization example used for the factory's test. On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <msoko...@gmail.com> wrote: > Yes that sounds good -- this ConditionalTokenFilter is going to be very > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke > around and see about incorporating the emoji rules from there. Thanks > Robert > > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcm...@gmail.com> wrote: > >> > Any thoughts? >> >> best idea I have would be to tokenize with ICUTokenizer, which will >> tag emoji sequences as "<EMOJI>" token type, then use >> ConditionalTokenFilter to send all tokens EXCEPT those with token type >> of "<EMOJI>" to your WordDelimiterFilter. This way >> WordDelimiterFilter never sees the emoji at all and can't screw them >> up. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org