Re: Stop words and Keyword tokenizer

Germán Carrillo Thu, 28 Aug 2014 14:34:37 -0700

Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
my docs have a "display" field to be returned to users after a search. For
the example given above, the display value would be something like:
"Mulaló, Yumbo, Valle del Cauca."


Itamar, I've actually considered several options. I think a synonym file
would be too big. I gave you 11 equivalent terms (you might've noticed I
could have continued to give you around 30 equivalent ways), but I didn't
mention place names (alone) have their corresponding synonyms, alternate
names, abbreviations, and vernacular names. There could be 10k different
places (docs) in the index. :D  Also, taking into account every single case
into the synonym file seems to be sub-optimal. Really, I intend to
normalize a large number of ways of expressing place hierarchy into a few
ways. Otherwise I'd have to build very large lists for each place I add to
the index, and nothing prevents I'm missing a weird case. BTW, handling
hierarchy is a must, otherwise result disambiguation would be a nightmare
for users.

Thanks for all the discussion, it's certainly valuable to read an expert's
opinion.

Back to my very first question, is the pattern replace token filter the
only way to remove stop words from tokens obtained from a keyword tokenizer?
Are those regular expressions not very performant?


2014-08-28 15:49 GMT-05:00 Ivan Brusic <[email protected]>:

> You mentioned in your original post "I'd like to obtain the original text
> without stop words"
>
> The stopword-less phrase will indeed be present in the index after the
> analysis phrase, however, when you ask for this content back as a result of
> a query, the original text will be returned. What is indexed is not
> necessarily what is stored/returned.
>
> Cheers,
>
> Ivan
>
>
> On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
> [email protected]> wrote:
>
>> Thanks Ivan,
>>
>> do you mean what I obtain from a request such as
>>
>> curl -XGET
>> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
>> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
>> (Valle del Cauca)'
>>
>> is not what will be present in the index after the analysis process? If
>> so, how could I check whether the stop words filter is being (will be)
>> applied to a sample phrase?
>>
>>
>> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>:
>>
>>>  Also note that the content returned will still contain the stop words.
>>> Only the inverted index will contain the stopword-less content.
>>>
>>> --
>>> Ivan
>>>
>>>
>>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <[email protected]
>>> > wrote:
>>>
>>>> What would be the usecase for such a process (removing stop words
>>>> without tokenization)?
>>>>
>>>> This may be a good read btw:
>>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/
>>>>
>>>> --
>>>>
>>>> Itamar Syn-Hershko
>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>> Freelance Developer & Consultant
>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>
>>>>
>>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I'm looking for a way to remove stop words from tokens returned by a
>>>>> keyword tokenizer, i.e., I'd like to obtain the original text without stop
>>>>> words after the analysis process.
>>>>>
>>>>> Sample data looks like:                         "El corregimiento de
>>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
>>>>> After the lowercase token filter:           "el corregimiento de
>>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
>>>>> After the ascii folding token filter:        "el corregimiento de
>>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
>>>>> After removing stop words:                   "corregimiento mulalo,
>>>>> municipio yumbo (valle cauca)"
>>>>>
>>>>> The stop words (currently) are:      ["la", "el", "de", "del", "los",
>>>>> "las", "jurisdiccion"]
>>>>>
>>>>> Is the pattern replace token filter the only (or best) way to go for
>>>>> such a task?
>>>>>
>>>>> I'd really like to avoid writing custom regular expressions rather
>>>>> than specifying a stop words list, which I know would work perfectly fine
>>>>> for other tokenizers.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Germán
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>  To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>  To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Stop words and Keyword tokenizer

Reply via email to