Hi,
thx for response and this awesome plugin bundle (especially for me as
german).
Unfortunately the hyphen analyzer plugin didnt do the job in the way i
wanted it to be.
The "hyphen-analyzer" does something similar like the whitespace analyzer -
it just dont split on hyphen and instead see them as ALPHANUM characters
(at least that is what i think right now).
So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
nice behaviour, but in order to make an "full-text-search" on this field it
should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
i wrote before.
i think maybe abusing the word_delimiter token filter could do the job,
because there is an option "preserve_original".
unfortunately if you adjust the filter like this:
PUT /logstash-2014.11.20
{
"index" : {
"analysis" : {
"analyzer" : {
"wordtest" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [
"lowercase",
"word"
]
}
},
"filter" : {
"word" : {
"type" : "word_delimiter",
"generate_word_parts": false,
"generate_number_parts": false,
"catenate_words": false,
"catenate_numbers": false,
"catenate_all": false,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}
and make an analyze test:
curl -XGET 'localhost:9200/logstash-2014.11.20/_analyze?filters=word' -d
'this-is-a-test'
the response is this:
{"tokens":[{"token":"this","start_offset":0,"end_offset":4,"type":"<ALPHANUM>","position":1},{"token":"is","start_offset":5,"end_offset":7,"type":"<ALPHANUM>","position":2},{"token":"a","start_offset":8,"end_offset":9,"type":"<ALPHANUM>","position":3},{"token":"test","start_offset":10,"end_offset":14,"type":"<ALPHANUM>","position":4}]
which just says it tokenized it in everything expect the original term,
which make me wonder if the preserver_original settings is working?
Any idea on this?
Am Mittwoch, 19. November 2014 18:26:09 UTC+1 schrieb Jörg Prante:
>
> You search for a hyphen-aware tokenizer, like this?
>
> https://gist.github.com/jprante/cd120eac542ba6eec965
>
> It is in my plugin bundle
>
> https://github.com/jprante/elasticsearch-plugin-bundle
>
> Jörg
>
> On Wed, Nov 19, 2014 at 5:46 PM, horst knete <[email protected]
> <javascript:>> wrote:
>
>> Hey guys,
>>
>> after working with the ELK stack for a while now, we still got an very
>> annoying problem regarding the behavior of the standard analyzer - it
>> splits terms into tokens using hyphens or dots as delimiters.
>>
>> e.g logsource:firewall-physical-management get split into "firewall" ,
>> "physical" and "management". On one side thats cool because if you search
>> for logsource:firewall you get all the events with firewall as an token in
>> the field logsource.
>>
>> The downside on this behaviour is if you are doing e.g. an "top 10
>> search" on an field in Kibana, all the tokens are counted as an whole term
>> and get rated due to their count:
>> top 10:
>> 1. firewall : 10
>> 2. physical : 10
>> 3. management: 10
>>
>> instead of top 10:
>> 1. firewall-physical-management: 10
>>
>> Well in the standard mapping from logstash this is solved using and .raw
>> field as "not_analyzed" but the downside on this is you got 2 fields
>> instead of one (even if its a multi_field) and the usage for kibana users
>> is not that great.
>>
>> So what we need is that logsource:firewall-physical-management get
>> tokenized into "firewall-physical-management", "firewall" , "physical" and
>> "management".
>>
>> I tried this using the word_delimiter filter token with the following
>> mapping:
>>
>> "analysis" : {
>> "analyzer" : {
>> "my_analyzer" : {
>> "type" : "custom",
>> "tokenizer" : "whitespace",
>> "filter" : ["lowercase", "asciifolding",
>> "my_worddelimiter"]
>> }
>> },
>> "filter" : {
>> "my_worddelimiter" : {
>> "type" : "word_delimiter",
>> "generate_word_parts": false,
>> "generate_number_parts": false,
>> "catenate_words": false,
>> "catenate_numbers": false,
>> "catenate_all": false,
>> "split_on_case_change": false,
>> "preserve_original": true,
>> "split_on_numerics": false,
>> "stem_english_possessive": true
>> }
>> }
>> }
>>
>> But this unfortunately didnt do the job.
>>
>> I´ve saw on my recherche that some other guys have an similar problem
>> like this, but expect some replacement suggestions, no real solution was
>> found.
>>
>> If anyone have some ideas on how to start working on this, i would be
>> very happy.
>>
>> thanks.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/4094292c-057f-43d8-9af0-1ea83ad45a1c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/64ac834c-3593-490d-8fe9-9a12404a98f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.