Re: Stop words and Keyword tokenizer

Ivan Brusic Thu, 28 Aug 2014 13:50:19 -0700

You mentioned in your original post "I'd like to obtain the original text
without stop words"


The stopword-less phrase will indeed be present in the index after the
analysis phrase, however, when you ask for this content back as a result of
a query, the original text will be returned. What is indexed is not
necessarily what is stored/returned.

Cheers,

Ivan


On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <[email protected]
> wrote:

> Thanks Ivan,
>
> do you mean what I obtain from a request such as
>
> curl -XGET
> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
> (Valle del Cauca)'
>
> is not what will be present in the index after the analysis process? If
> so, how could I check whether the stop words filter is being (will be)
> applied to a sample phrase?
>
>
> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>:
>
>> Also note that the content returned will still contain the stop words.
>> Only the inverted index will contain the stopword-less content.
>>
>> --
>> Ivan
>>
>>
>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <[email protected]>
>> wrote:
>>
>>> What would be the usecase for such a process (removing stop words
>>> without tokenization)?
>>>
>>> This may be a good read btw:
>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>> Freelance Developer & Consultant
>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>
>>>
>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>>
>>>> I'm looking for a way to remove stop words from tokens returned by a
>>>> keyword tokenizer, i.e., I'd like to obtain the original text without stop
>>>> words after the analysis process.
>>>>
>>>> Sample data looks like:                         "El corregimiento de
>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
>>>> After the lowercase token filter:           "el corregimiento de
>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
>>>> After the ascii folding token filter:        "el corregimiento de
>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
>>>> After removing stop words:                   "corregimiento mulalo,
>>>> municipio yumbo (valle cauca)"
>>>>
>>>> The stop words (currently) are:      ["la", "el", "de", "del", "los",
>>>> "las", "jurisdiccion"]
>>>>
>>>> Is the pattern replace token filter the only (or best) way to go for
>>>> such a task?
>>>>
>>>> I'd really like to avoid writing custom regular expressions rather than
>>>> specifying a stop words list, which I know would work perfectly fine for
>>>> other tokenizers.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Germán
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Stop words and Keyword tokenizer

Reply via email to