Character filters are executed before the tokenizer, so only something in that family of filters would work if you plan to continue using the keyword tokenizer.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html The mapping char filter might be a better match if you list is not in regex form. I use the mapping char filter to remove copyright, trademark and a whole list of other characters from my content. Cheers, Ivan On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo <[email protected]> wrote: > Ivan, yes, I'm aware I would obtain another text, that's fine. Even more, > my docs have a "display" field to be returned to users after a search. For > the example given above, the display value would be something like: > "Mulaló, Yumbo, Valle del Cauca." > > Itamar, I've actually considered several options. I think a synonym file > would be too big. I gave you 11 equivalent terms (you might've noticed I > could have continued to give you around 30 equivalent ways), but I didn't > mention place names (alone) have their corresponding synonyms, alternate > names, abbreviations, and vernacular names. There could be 10k different > places (docs) in the index. :D Also, taking into account every single case > into the synonym file seems to be sub-optimal. Really, I intend to > normalize a large number of ways of expressing place hierarchy into a few > ways. Otherwise I'd have to build very large lists for each place I add to > the index, and nothing prevents I'm missing a weird case. BTW, handling > hierarchy is a must, otherwise result disambiguation would be a nightmare > for users. > > Thanks for all the discussion, it's certainly valuable to read an expert's > opinion. > > Back to my very first question, is the pattern replace token filter the > only way to remove stop words from tokens obtained from a keyword tokenizer? > Are those regular expressions not very performant? > > > 2014-08-28 15:49 GMT-05:00 Ivan Brusic <[email protected]>: > >> You mentioned in your original post "I'd like to obtain the original >> text without stop words" >> >> The stopword-less phrase will indeed be present in the index after the >> analysis phrase, however, when you ask for this content back as a result of >> a query, the original text will be returned. What is indexed is not >> necessarily what is stored/returned. >> >> Cheers, >> >> Ivan >> >> >> On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo < >> [email protected]> wrote: >> >>> Thanks Ivan, >>> >>> do you mean what I obtain from a request such as >>> >>> curl -XGET >>> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords' >>> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo >>> (Valle del Cauca)' >>> >>> is not what will be present in the index after the analysis process? If >>> so, how could I check whether the stop words filter is being (will be) >>> applied to a sample phrase? >>> >>> >>> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>: >>> >>>> Also note that the content returned will still contain the stop >>>> words. Only the inverted index will contain the stopword-less content. >>>> >>>> -- >>>> Ivan >>>> >>>> >>>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko < >>>> [email protected]> wrote: >>>> >>>>> What would be the usecase for such a process (removing stop words >>>>> without tokenization)? >>>>> >>>>> This may be a good read btw: >>>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ >>>>> >>>>> -- >>>>> >>>>> Itamar Syn-Hershko >>>>> http://code972.com | @synhershko <https://twitter.com/synhershko> >>>>> Freelance Developer & Consultant >>>>> Author of RavenDB in Action <http://manning.com/synhershko/> >>>>> >>>>> >>>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> I'm looking for a way to remove stop words from tokens returned by a >>>>>> keyword tokenizer, i.e., I'd like to obtain the original text without >>>>>> stop >>>>>> words after the analysis process. >>>>>> >>>>>> Sample data looks like: "El corregimiento de >>>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)" >>>>>> After the lowercase token filter: "el corregimiento de >>>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)" >>>>>> After the ascii folding token filter: "el corregimiento de >>>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)" >>>>>> After removing stop words: "corregimiento mulalo, >>>>>> municipio yumbo (valle cauca)" >>>>>> >>>>>> The stop words (currently) are: ["la", "el", "de", "del", "los", >>>>>> "las", "jurisdiccion"] >>>>>> >>>>>> Is the pattern replace token filter the only (or best) way to go for >>>>>> such a task? >>>>>> >>>>>> I'd really like to avoid writing custom regular expressions rather >>>>>> than specifying a stop words list, which I know would work perfectly fine >>>>>> for other tokenizers. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> Germán >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
