Thanks Ivan! I'll test which way fits better to my needs.


2014-08-28 17:12 GMT-05:00 Ivan Brusic <[email protected]>:

> Character filters are executed before the tokenizer, so only something in
> that family of filters would work if you plan to continue using the keyword
> tokenizer.
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
>
> The mapping char filter might be a better match if you list is not in
> regex form. I use the mapping char filter to remove copyright, trademark
> and a whole list of other characters from my content.
>
> Cheers,
>
> Ivan
>
>
> On Thu, Aug 28, 2014 at 2:33 PM, Germán Carrillo <
> [email protected]> wrote:
>
>> Ivan, yes, I'm aware I would obtain another text, that's fine. Even more,
>> my docs have a "display" field to be returned to users after a search. For
>> the example given above, the display value would be something like:
>> "Mulaló, Yumbo, Valle del Cauca."
>>
>> Itamar, I've actually considered several options. I think a synonym file
>> would be too big. I gave you 11 equivalent terms (you might've noticed I
>> could have continued to give you around 30 equivalent ways), but I didn't
>> mention place names (alone) have their corresponding synonyms, alternate
>> names, abbreviations, and vernacular names. There could be 10k different
>> places (docs) in the index. :D  Also, taking into account every single case
>> into the synonym file seems to be sub-optimal. Really, I intend to
>> normalize a large number of ways of expressing place hierarchy into a few
>> ways. Otherwise I'd have to build very large lists for each place I add to
>> the index, and nothing prevents I'm missing a weird case. BTW, handling
>> hierarchy is a must, otherwise result disambiguation would be a nightmare
>> for users.
>>
>> Thanks for all the discussion, it's certainly valuable to read an
>> expert's opinion.
>>
>> Back to my very first question, is the pattern replace token filter the
>> only way to remove stop words from tokens obtained from a keyword tokenizer?
>> Are those regular expressions not very performant?
>>
>>
>> 2014-08-28 15:49 GMT-05:00 Ivan Brusic <[email protected]>:
>>
>>>  You mentioned in your original post "I'd like to obtain the original
>>> text without stop words"
>>>
>>> The stopword-less phrase will indeed be present in the index after the
>>> analysis phrase, however, when you ask for this content back as a result of
>>> a query, the original text will be returned. What is indexed is not
>>> necessarily what is stored/returned.
>>>
>>> Cheers,
>>>
>>> Ivan
>>>
>>>
>>> On Thu, Aug 28, 2014 at 12:30 PM, Germán Carrillo <
>>> [email protected]> wrote:
>>>
>>>> Thanks Ivan,
>>>>
>>>> do you mean what I obtain from a request such as
>>>>
>>>> curl -XGET
>>>> 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase,my_ascii_folding,my_stopwords'
>>>> -d 'El corregimiento de Mulaló, jurisdicción del municipio de Yumbo
>>>> (Valle del Cauca)'
>>>>
>>>> is not what will be present in the index after the analysis process? If
>>>> so, how could I check whether the stop words filter is being (will be)
>>>> applied to a sample phrase?
>>>>
>>>>
>>>> 2014-08-28 14:03 GMT-05:00 Ivan Brusic <[email protected]>:
>>>>
>>>>>  Also note that the content returned will still contain the stop
>>>>> words. Only the inverted index will contain the stopword-less content.
>>>>>
>>>>> --
>>>>> Ivan
>>>>>
>>>>>
>>>>> On Thu, Aug 28, 2014 at 11:55 AM, Itamar Syn-Hershko <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> What would be the usecase for such a process (removing stop words
>>>>>> without tokenization)?
>>>>>>
>>>>>> This may be a good read btw:
>>>>>> http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Itamar Syn-Hershko
>>>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>>>> Freelance Developer & Consultant
>>>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 28, 2014 at 9:48 PM, German Carrillo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I'm looking for a way to remove stop words from tokens returned by a
>>>>>>> keyword tokenizer, i.e., I'd like to obtain the original text without 
>>>>>>> stop
>>>>>>> words after the analysis process.
>>>>>>>
>>>>>>> Sample data looks like:                         "El corregimiento de
>>>>>>> Mulaló, jurisdicción del municipio de Yumbo (Valle del Cauca)"
>>>>>>> After the lowercase token filter:           "el corregimiento de
>>>>>>> mulaló, jurisdicción del municipio de yumbo (valle del cauca)"
>>>>>>> After the ascii folding token filter:        "el corregimiento de
>>>>>>> mulalo, jurisdiccion del municipio de yumbo (valle del cauca)"
>>>>>>> After removing stop words:                   "corregimiento mulalo,
>>>>>>> municipio yumbo (valle cauca)"
>>>>>>>
>>>>>>> The stop words (currently) are:      ["la", "el", "de", "del",
>>>>>>> "los", "las", "jurisdiccion"]
>>>>>>>
>>>>>>> Is the pattern replace token filter the only (or best) way to go for
>>>>>>> such a task?
>>>>>>>
>>>>>>> I'd really like to avoid writing custom regular expressions rather
>>>>>>> than specifying a stop words list, which I know would work perfectly 
>>>>>>> fine
>>>>>>> for other tokenizers.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Germán
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/038ff037-ccf3-4aca-b0c0-bb421531c495%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zu%2BJGsL7Srsg7inbs3TkejOqp4jFZ1op-18WfiT3VoGOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>  To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCJAM-4nJAKjUix7GvT9766%2B5si_z76txfnt-S-BTJqBw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>  To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mxuoDv3cV83nUgr-SXentuwfBcs3bX8oLMA_tvBd40bWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>  To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCWTx%2B%2BSPvA_wzXoyP_jjzaaekGoRsCeb2zZ7ps55vYnA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>  To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CANaz7mx6RQ5HMS12uyH4wpAXJo2UsE5rV5L%2Bqpk98dBnrjkv8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB7V%2B20202bENTvqbJ86%2BDaNSMLDCXpq%2B5nY6F1qa3DWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANaz7mzfKXDrBtweeHmCdYjbN%2B%3DR3HWHi0NWhgXVfxnnXL57yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to