The whitespace tokenizer has the problem that punctuation is not ignored. I
find the word_delimiter filter not working at all with whitespace, only
with keyword tokenizer, with massive pattern matching which is complex and
expensive :(

Therefore I took the classic tokenizer and generalized the hyphen rules in
the grammar. The tokenizer "hyphen" and filter "hyphen" are two routines.
The tokenizer "hyphen" keeps hyphenated words together and handles
punctuation correct. The filter "hyphen" adds combinations to the original
form.

Main point is to add combinations of dehyphenated forms so they can be
searched.

Single words are only taken into account when the word is positioned at the
edge.

For example, the phrase "der-die-das" should be indexed in the following
forms:

"der-die-das",  "derdiedas", "das", "derdie", derdie-das", "die-das", "der"

Jörg

On Thu, Nov 20, 2014 at 9:29 AM, horst knete <[email protected]> wrote:

>
> So the term "this-is-a-test" get tokenized into "this-is-a-test" which is
> nice behaviour, but in order to make an "full-text-search" on this field it
> should get tokenized into "this-is-a-test", "this", "is", "a" and "test" as
> i wrote before.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEveN15MGdB-2fKAx46bntZ8VO8ii88BNxDkfo6W5jPMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to