Thanks Ivan! I'll test which way fits better to my needs.
2014-08-28 17:12 GMT-05:00 Ivan Brusic <[email protected]>: > Character filters are executed before the tokenizer, so only something in > that family of filters would work if you plan to continue using the keyword > tokenizer. > > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html > > The mapping char filter might be a better match if you list is not in > regex form. I use the mapping char filter to remove copyright, trademark > and a whole list of other characters from my content. > > Cheers, > > Ivan > > > On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo < > [email protected]> wrote: > >> Ivan, yes, I'm aware I would obtain another text, that's fine. Even more, >> my docs have a "display" field to be returned to users after a search. For >> the example given above, the display value would be something like: >> "Mulaló, Yumbo, Valle del Cauca." >> >> Itamar, I've actually considered several options. I think a synonym file >> would be too big. I gave you 11 equivalent terms (you might've noticed I >> could have continued to give you around 30 equivalent ways), but I didn't >> mention place names (alone) have their corresponding synonyms, alternate >> names, abbreviations, and vernacular names. There could be 10k different >> places (docs) in the index. :D Also, taking into account every single case >> into the synonym file seems to be sub-optimal. Really, I intend to >> normalize a large number of ways of expressing place hierarchy into a few >> ways. Otherwise I'd have to build very large lists for each place I add to >> the index, and nothing prevents I'm missing a weird case. BTW, handling >> hierarchy is a must, otherwise result disambiguation would be a nightmare >> for users. >> >> Thanks for all the discussion, it's certainly valuable to read an >> expert's opinion. >> >> Back to my very first question, is the pattern replace token filter the >> only way to remove stop words from tokens obtained from a keyword tokenizer? >> Are those regular expressions not very performant? >> >> >> 2014-08-28 15:49 GMT-05:00 Ivan Brusic <[email protected]>: >> >>> You mentioned in your original post "I'd like to obtain the original >>> text without stop words" >>> >>> The stopword-less phrase will indeed be present in the index after the >>> analysis phrase, however, when you ask for this content back as a result of >>> a query, the original text will be returned. What is indexed is not >>> necessarily what is stored/returned. >>> >>> Cheers, >>> >>> Ivan >>> >>> >>> On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo < >>> [email protected]> wrote: >>> >>>> Thanks Ivan, >>>> >>>> do you mean what I obtain from a request such as >>>> >>>> curl -XGET >>>> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords' >>>> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo >>>> (Valle del Cauca)' >>>> >>>> is not what will be present in the index after the analysis process? If >>>> so, how could I check whether the stop words filter is being (will be) >>>> applied to a sample phrase? >>>> >>>> >>>> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>: >>>> >>>>> Also note that the content returned will still contain the stop >>>>> words. Only the inverted index will contain the stopword-less content. >>>>> >>>>> -- >>>>> Ivan >>>>> >>>>> >>>>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko < >>>>> [email protected]> wrote: >>>>> >>>>>> What would be the usecase for such a process (removing stop words >>>>>> without tokenization)? >>>>>> >>>>>> This may be a good read btw: >>>>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ >>>>>> >>>>>> -- >>>>>> >>>>>> Itamar Syn-Hershko >>>>>> http://code972.com | @synhershko <https://twitter.com/synhershko> >>>>>> Freelance Developer & Consultant >>>>>> Author of RavenDB in Action <http://manning.com/synhershko/> >>>>>> >>>>>> >>>>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> >>>>>>> I'm looking for a way to remove stop words from tokens returned by a >>>>>>> keyword tokenizer, i.e., I'd like to obtain the original text without >>>>>>> stop >>>>>>> words after the analysis process. >>>>>>> >>>>>>> Sample data looks like: "El corregimiento de >>>>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)" >>>>>>> After the lowercase token filter: "el corregimiento de >>>>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)" >>>>>>> After the ascii folding token filter: "el corregimiento de >>>>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)" >>>>>>> After removing stop words: "corregimiento mulalo, >>>>>>> municipio yumbo (valle cauca)" >>>>>>> >>>>>>> The stop words (currently) are: ["la", "el", "de", "del", >>>>>>> "los", "las", "jurisdiccion"] >>>>>>> >>>>>>> Is the pattern replace token filter the only (or best) way to go for >>>>>>> such a task? >>>>>>> >>>>>>> I'd really like to avoid writing custom regular expressions rather >>>>>>> than specifying a stop words list, which I know would work perfectly >>>>>>> fine >>>>>>> for other tokenizers. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Germán >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elasticsearch" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANaz7mzfKXDrBtweeHmCdYjbN%2B%3DR3HWHi0NWhgXVfxnnXL57yQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
