The whitespace tokenizer has the problem that punctuation is not ignored. I find the word_delimiter filter not working at all with whitespace, only with keyword tokenizer, with massive pattern matching which is complex and expensive :(
Therefore I took the classic tokenizer and generalized the hyphen rules in the grammar. The tokenizer "hyphen" and filter "hyphen" are two routines. The tokenizer "hyphen" keeps hyphenated words together and handles punctuation correct. The filter "hyphen" adds combinations to the original form. Main point is to add combinations of dehyphenated forms so they can be searched. Single words are only taken into account when the word is positioned at the edge. For example, the phrase "der-die-das" should be indexed in the following forms: "der-die-das", "derdiedas", "das", "derdie", derdie-das", "die-das", "der" Jörg On Thu, Nov 20, 2014 at 9:29 AM, horst knete <[email protected]> wrote: > > So the term "this-is-a-test" get tokenized into "this-is-a-test" which is > nice behaviour, but in order to make an "full-text-search" on this field it > should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as > i wrote before. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEveN15MGdB-2fKAx46bntZ8VO8ii88BNxDkfo6W5jPMw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
