Hi Ivan, Using a test index and the analyze API, I was no able to create a config, which is fine for me... theoretically. { "template": "logstash-*", "settings": { "analysis": { "filter": {
"my_word_delimiter": { "type": "word_delimiter", "preserve_original": "true" } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase", "stop", "my_word_delimiter", "asciifolding"] } } } }, "mappings": { "_default_": { "properties": { "excp": { "type": "string", "index": "analyzed", "analyzer": "my_analyzer" }, "msg": { "type": "string", "index": "not_analyzed", "analyzer": "my_analyzer" } } } } } The problem now is, as soon as I activate this for the two fields and have a new logstash index created I cannot use a simpleQueryString query to retrieve any results. It won't find anything via the REST api. Using the standard logstash template and mapping it works fine. Have you observed anything simililar? Thx Marc On Friday, August 29, 2014 6:49:41 PM UTC+2, Ivan Brusic wrote: > > That output does not look like the something generated from the standard > analyzer since it contains uppercase letters and various non-word > characters such as '='. > > Your two analysis requests will differ since the second one contains the > default word_delimiter filter instead of your custom my_word_delimiter. > What you are trying to achieve is somewhat difficult, but you can get there > if you keep on tweaking. :) Try using a pattern tokenizer instead of the > whitespace tokenizer if you want more control over word boundaries. > > -- > Ivan > > > On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.o...@googlemail.com > <javascript:>> wrote: > > Hi Ivan, > > thanks again. I have tried so and found a reasonable combination. > Nevertheless, when I now try to use the analyze api with an index that has > the said analyzer defined via template it doesn't seem to apply: > > This is the complete template: > { > "template": "bogstash-*", > "settings": { > "index.number_of_replicas": 0, > "analysis": { > "analyzer": { > "msg_excp_analyzer": { > "type": "custom", > "tokenizer": "whitespace", > "filters": ["word_delimiter", > "lowercase", > "asciifolding", > "shingle", > "standard"] > } > }, > "filters": { > "my_word_delimiter": { > "type": "word_delimiter", > "preserve_original": "true" > }, > "my_asciifolding": { > "type": "asciifolding", > "preserve_original": true > } > } > } > }, > "mappings": { > "_default_": { > "properties": { > "@excp": { > "type": "string", > "index": "analyzed", > "analyzer": "msg_excp_analyzer" > }, > "@msg": { > "type": "string", > "index": "analyzed", > "analyzer": "msg_excp_analyzer" > } > } > } > } > } > I create the index bogstash-1. > Now I test the following: > curl -XGET > 'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d > 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated > attributes=gps_lng: 183731222/ gps_lat: 289309222/ )' > and it returns: > { > "tokens" : [ { > "token" : "Service=MyMDB.onMessage", > "start_offset" : 0, > "end_offset" : 23, > "type" : "word", > "position" : 1 > }, { > "token" : "appId=cs", > "start_offset" : 24, > "end_offset" : 32, > "type" : "word", > "position" : 2 > }, { > "token" : "Times=Me:22/Total:22", > "start_offset" : 33, > "end_offset" : 53, > "type" : "word", > "position" : 3 > }, { > "token" : "(updated", > "start_offset" : 54, > "end_offset" : 62, > "type" : "word", > "position" : 4 > }, { > "token" : "attributes=gps_lng:", > "start_offset" : 63, > "end_offset" : 82, > "type" : "word", > "position" : 5 > }, { > "token" : "183731222/", > "start_offset" : 83, > "end_offset" : 93, > "type" : "word", > "position" : 6 > }, { > "token" : "gps_lat:", > "start_offset" : 94, > "end_offset" : 102, > "type" : "word", > "position" : 7 > }, { > "token" : "289309222/", > "start_offset" : 103, > "end_offset" : 113, > "type" : "word", > "position" : 8 > }, { > "token" : ")", > "start_offset" : 114, > "end_offset" : 115, > "type" : "word", > "position" : 9 > } ] > } > Which is the output of a standard analyzer. > Giving the tokenizer and filters in the analyze API directly works fine: > curl -XGET > 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1' > > -d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated > attributes=gps_lng: 183731222/ gps_lat: 289309222/ )' > This results in: > { > "tokens" : [ { > "token" : "service", > "start_offset" : 0, > "end_offset" : 7, > "type" : "word", > "position" : 1 > }, { > "token" : "service mymdb", > "start_offset" : 0, > "end_offset" : 13, > "type" : "shingle", > "position" : 1 > }, { > "token" : "mymdb", > "start_offset" : 8, > "end_offset" : 13, > "type" : "word", > "position" : 2 > }, { > "token" : "mymdb onmessage", > "start_offset" : 8, > "end_offset" : 23, > "type" : "shingle", > "position" : 2 > }, { > "token" : "onmessage", > "start_offset" : 14, > "end_offset" : 23, > "type" : "word", > "position" : 3 > }, { > "token" : "onmessage appid", > "start_offset" : 14, > "end_offset" : 29, > "type" : "shingle", > "position" : 3 > }, { > "token" : "appid", > "start_offset" : 24, > "end_offset" : 29, > "type" : "word", > "position" : 4 > }, { > "token" : "appid cs", > "start_offset" : 24, > "end_offset" : 32, > "type" : "shingle", > "position" : 4 > }, { > "token" : "cs", > "start_offset" : 30, > "end_offset" : 32, > "type" : "word", > "position" : 5 > }, { > "token" : "cs times", > "start_offset" : 30, > "end_offset" : 38, > "type" : "shingle", > "position" : 5 > }, { > "token" : "times", > "start_offset" : 33, > "end_offset" : <span style="color:#06 > > ... -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9bf32a56-a490-44e9-8efd-676587c22621%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.