Thanks for the pointer On Tue, Jul 3, 2018 at 9:04 AM julien Blaize <julien.bla...@gmail.com> wrote:
> Hello Michael, > > i had previously worked on emoji detection with lucene. > > I had to extends the Tokenizer class (and not the TokenFilter like > WordDelimiterFilter) to preserve the delimiter attribute. > I also had to keep track of consecutive delimiters in the character stream > because Lucene default implementation only keep the last one. > > Maybe it can put you on the right track to start by looking at the > Tokenizer instead of the TokenFilter. > > By the way I used the emoji list from this project to detect sequences of > characters. > > https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-fr.txt > I detect sequences of character and while the sequence is a possible emoji > i keep tracking, when i have a full emoji i put it in the CharTermAttribute > so it's treated as a word and not a delimiter. > > Regards > -- > Julien Blaize > > > Le mar. 3 juil. 2018 à 14:00, Michael Sokolov <msoko...@gmail.com> a > écrit : > > > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters > > like punctuation and thus remove them, but we would like to be able to > > search for emoji and use this filter for handling dashes, dots and other > > intra-word punctuation. > > > > These filters identify non-word and non-digit characters by two > mechanisms: > > direct lookup in a character table, and fallback to Unicode class. The > > character table can't easily be used to handle emoji since it would need > to > > be populated with the entire Unicode character set in order to reach > > emoji-land. On the other hand, if we change the handling of emoji by > class, > > and say treat them as word-characters, this will also end up pulling in > all > > the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think > > some of these other symbols are more like punctuation (this class is a > grab > > bag of all kinds of beautiful dingbats like trademark, degrees-symbols, > etc > > https://www.compart.com/en/unicode/category/So). On the other other > hand, > > how do we even identify emoji? I don't think the Java Character API is > > adequate to the task. Perhaps we must incorporate a table. > > > > Suppose we come up with a good way to classify emoji; then how should > they > > be treated in this class? Sometimes they may be embedded in tokens with > > other characters: I see people using emoji and other symbols as part of > > their names, and sometimes they stand alone (with whitespace > separation). I > > think one way forward here would be to treat these as a special class > akin > > to words and numbers, and provide similar options (SPLIT_ON_EMOJI, > > CATENATE_EMOJI) as we have for those classes. > > > > Or maybe as a convenience, we provide a way to get a table that encodes > the > > default classifications of all characters up to some given limit, and > then > > let the caller modify it? That would at least provide an easy way to > treat > > emoji as letters. > > > > Any thoughts? > > >