Re: tokenization help mixed n-grams

Masaru Hasegawa Thu, 26 Feb 2015 23:57:07 -0800

Hi,

You can use mapping char filter to remove white space and then ngram tokenises 
with min_gram=2/max_gram=<whatever you like> to make it ngrams.
(not sure if you’d like to omit “bc”, “bcd”… or not though)



Masaru


On February 26, 2015 at 21:46:42, Ilija Subasic ([email protected]) wrote:
> Hi,
> I am trying to create a tokenizer that is going to create a tokens
> looking something like this:
>  
> "ab c dd c" would be tokenized as "ab", "abc", "abcd", "abcdd", "abcddc",
> "cd", "cdd", "cddc", "dd", "ddc"
>  
> so basically I need something that is going to do an ngram indexing from
> the start of each token. This is different
> then edge n-gram which will tokenize each token separatelly.
> Any ideas on how to do this without coding a specific tokenizer.
>  
> Thanks,
> Ilija
>  
> --
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch"  
> group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].  
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/ec72da21-ed99-4cc8-829c-058467c020a5%40googlegroups.com.
>   
> For more options, visit https://groups.google.com/d/optout.
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/etPan.54f02339.625558ec.129%40citra.local.
For more options, visit https://groups.google.com/d/optout.

Re: tokenization help mixed n-grams

Reply via email to