Re: Stop words and Keyword tokenizer

Ivan Brusic Thu, 28 Aug 2014 15:13:16 -0700

Character filters are executed before the tokenizer, so only something in
that family of filters would work if you plan to continue using the keyword
tokenizer.


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

The mapping char filter might be a better match if you list is not in regex
form. I use the mapping char filter to remove copyright, trademark and a
whole list of other characters from my content.

Cheers,

Ivan


On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo <[email protected]>
wrote:

> Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
> my docs have a "display" field to be returned to users after a search. For
> the example given above, the display value would be something like:
> "Mulaló, Yumbo, Valle del Cauca."
>
> Itamar, I've actually considered several options. I think a synonym file
> would be too big. I gave you 11 equivalent terms (you might've noticed I
> could have continued to give you around 30 equivalent ways), but I didn't
> mention place names (alone) have their corresponding synonyms, alternate
> names, abbreviations, and vernacular names. There could be 10k different
> places (docs) in the index. :D  Also, taking into account every single case
> into the synonym file seems to be sub-optimal. Really, I intend to
> normalize a large number of ways of expressing place hierarchy into a few
> ways. Otherwise I'd have to build very large lists for each place I add to
> the index, and nothing prevents I'm missing a weird case. BTW, handling
> hierarchy is a must, otherwise result disambiguation would be a nightmare
> for users.
>
> Thanks for all the discussion, it's certainly valuable to read an expert's
> opinion.
>
> Back to my very first question, is the pattern replace token filter the
> only way to remove stop words from tokens obtained from a keyword tokenizer?
> Are those regular expressions not very performant?
>
>
> 2014-08-28 15:49 GMT-05:00 Ivan Brusic <[email protected]>:
>
>> You mentioned in your original post "I'd like to obtain the original
>> text without stop words"
>>
>> The stopword-less phrase will indeed be present in the index after the
>> analysis phrase, however, when you ask for this content back as a result of
>> a query, the original text will be returned. What is indexed is not
>> necessarily what is stored/returned.
>>
>> Cheers,
>>
>> Ivan
>>
>>
>> On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
>> [email protected]> wrote:
>>
>>> Thanks Ivan,
>>>
>>> do you mean what I obtain from a request such as
>>>
>>> curl -XGET
>>> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
>>> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
>>> (Valle del Cauca)'
>>>
>>> is not what will be present in the index after the analysis process? If
>>> so, how could I check whether the stop words filter is being (will be)
>>> applied to a sample phrase?
>>>
>>>
>>> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>:
>>>
>>>>  Also note that the content returned will still contain the stop
>>>> words. Only the inverted index will contain the stopword-less content.
>>>>
>>>> --
>>>> Ivan
>>>>
>>>>
>>>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <
>>>> [email protected]> wrote:
>>>>
>>>>> What would be the usecase for such a process (removing stop words
>>>>> without tokenization)?
>>>>>
>>>>> This may be a good read btw:
>>>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/
>>>>>
>>>>> --
>>>>>
>>>>> Itamar Syn-Hershko
>>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>>> Freelance Developer & Consultant
>>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>>
>>>>>
>>>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I'm looking for a way to remove stop words from tokens returned by a
>>>>>> keyword tokenizer, i.e., I'd like to obtain the original text without 
>>>>>> stop
>>>>>> words after the analysis process.
>>>>>>
>>>>>> Sample data looks like:                         "El corregimiento de
>>>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
>>>>>> After the lowercase token filter:           "el corregimiento de
>>>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
>>>>>> After the ascii folding token filter:        "el corregimiento de
>>>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
>>>>>> After removing stop words:                   "corregimiento mulalo,
>>>>>> municipio yumbo (valle cauca)"
>>>>>>
>>>>>> The stop words (currently) are:      ["la", "el", "de", "del", "los",
>>>>>> "las", "jurisdiccion"]
>>>>>>
>>>>>> Is the pattern replace token filter the only (or best) way to go for
>>>>>> such a task?
>>>>>>
>>>>>> I'd really like to avoid writing custom regular expressions rather
>>>>>> than specifying a stop words list, which I know would work perfectly fine
>>>>>> for other tokenizers.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Germán
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>  To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>  To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Stop words and Keyword tokenizer

Reply via email to