Re: Changing Analyzer behavior for hyphens - suggestions?

horst knete Thu, 20 Nov 2014 00:30:05 -0800

Hi,

thx for response and this awesome plugin bundle (especially for me as 
german).


Unfortunately the hyphen analyzer plugin didnt do the job in the way i 
wanted it to be.

The "hyphen-analyzer" does something similar like the whitespace analyzer - 
it just dont split on hyphen and instead see them as ALPHANUM characters 
(at least that is what i think right now).

So the term "this-is-a-test" get tokenized into "this-is-a-test" which is 
nice behaviour, but in order to make an "full-text-search" on this field it 
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as 
i wrote before.

i think maybe abusing the word_delimiter token filter could do the job, 
because there is an option "preserve_original".

unfortunately if you adjust the filter like this:

PUT /logstash-2014.11.20
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "wordtest" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : [
                        "lowercase",
                        "word"
                    ]
                }
            },
           "filter" : {
                "word" : {
                        "type" : "word_delimiter",
                                "generate_word_parts": false,
                                "generate_number_parts": false,
                                "catenate_words": false,
                                "catenate_numbers": false,
                                "catenate_all": false,
                                "split_on_case_change": false,
                                "preserve_original": true,
                                "split_on_numerics": false,
                                "stem_english_possessive": true
                   }
              }
            }
        }
    }

and make an analyze test:

curl -XGET 'localhost:9200/logstash-2014.11.20/_analyze?filters=word' -d 
'this-is-a-test'

the response is this:
{"tokens":[{"token":"this","start_offset":0,"end_offset":4,"type":"<ALPHANUM>","position":1},{"token":"is","start_offset":5,"end_offset":7,"type":"<ALPHANUM>","position":2},{"token":"a","start_offset":8,"end_offset":9,"type":"<ALPHANUM>","position":3},{"token":"test","start_offset":10,"end_offset":14,"type":"<ALPHANUM>","position":4}]

which just says it tokenized it in everything expect the original term, 
which make me wonder if the preserver_original settings is working?

Any idea on this?

Am Mittwoch, 19. November 2014 18:26:09 UTC+1 schrieb Jörg Prante:
>
> You search for a hyphen-aware tokenizer, like this?
>
> https://gist.github.com/jprante/cd120eac542ba6eec965
>
> It is in my plugin bundle
>
> https://github.com/jprante/elasticsearch-plugin-bundle
>
> Jörg
>
> On Wed, Nov 19, 2014 at 5:46 PM, horst knete <[email protected] 
> <javascript:>> wrote:
>
>> Hey guys,
>>
>> after working with the ELK stack for a while now, we still got an very 
>> annoying problem regarding the behavior of the standard analyzer - it 
>> splits terms into tokens using hyphens or dots as delimiters.
>>
>> e.g logsource:firewall-physical-management get split into "firewall" , 
>> "physical" and "management". On one side thats cool because if you search 
>> for logsource:firewall you get all the events with firewall as an token in 
>> the field logsource. 
>>
>> The downside on this behaviour is if you are doing e.g. an "top 10 
>> search" on an field in Kibana, all the tokens are counted as an whole term 
>> and get rated due to their count: 
>> top 10: 
>> 1. firewall : 10
>> 2. physical : 10
>> 3. management: 10
>>
>> instead of top 10:
>> 1. firewall-physical-management: 10
>>
>> Well in the standard mapping from logstash this is solved using and .raw 
>> field as "not_analyzed" but the downside on this is you got 2 fields 
>> instead of one (even if its a multi_field) and the usage for kibana users 
>> is not that great.
>>
>> So what we need is that logsource:firewall-physical-management get 
>> tokenized into "firewall-physical-management", "firewall" , "physical" and 
>> "management".
>>
>> I tried this using the word_delimiter filter token with the following 
>> mapping:
>>
>>  "analysis" : {
>>  "analyzer" : {
>>                          "my_analyzer" : {
>>                                  "type" : "custom",
>>                                  "tokenizer" : "whitespace",
>>                                  "filter" : ["lowercase", "asciifolding", 
>> "my_worddelimiter"]
>>                                      }
>>               },
>>  "filter" : {
>>         "my_worddelimiter" : {
>>                 "type" : "word_delimiter",
>>                                 "generate_word_parts": false,
>>                                 "generate_number_parts": false,
>>                                 "catenate_words": false,
>>                                 "catenate_numbers": false,
>>                                 "catenate_all": false,
>>                                 "split_on_case_change": false,
>>                                 "preserve_original": true,
>>                                 "split_on_numerics": false,
>>                                 "stem_english_possessive": true
>>                    }
>>               }
>>               }
>>
>> But this unfortunately didnt do the job.
>>
>> I´ve saw on my recherche that some other guys have an similar problem 
>> like this, but expect some replacement suggestions, no real solution was 
>> found.
>>
>> If anyone have some ideas on how to start working on this, i would be 
>> very happy.
>>
>> thanks.
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/64ac834c-3593-490d-8fe9-9a12404a98f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Changing Analyzer behavior for hyphens - suggestions?

Reply via email to