That output does not look like the something generated from the standard analyzer since it contains uppercase letters and various non-word characters such as '='.
Your two analysis requests will differ since the second one contains the default word_delimiter filter instead of your custom my_word_delimiter. What you are trying to achieve is somewhat difficult, but you can get there if you keep on tweaking. :) Try using a pattern tokenizer instead of the whitespace tokenizer if you want more control over word boundaries. -- Ivan On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.off...@googlemail.com> wrote: > Hi Ivan, > > thanks again. I have tried so and found a reasonable combination. > Nevertheless, when I now try to use the analyze api with an index that has > the said analyzer defined via template it doesn't seem to apply: > > This is the complete template: > { > "template": "bogstash-*", > "settings": { > "index.number_of_replicas": 0, > "analysis": { > "analyzer": { > "msg_excp_analyzer": { > "type": "custom", > "tokenizer": "whitespace", > "filters": ["word_delimiter", > "lowercase", > "asciifolding", > "shingle", > "standard"] > } > }, > "filters": { > "my_word_delimiter": { > "type": "word_delimiter", > "preserve_original": "true" > }, > "my_asciifolding": { > "type": "asciifolding", > "preserve_original": true > } > } > } > }, > "mappings": { > "_default_": { > "properties": { > "@excp": { > "type": "string", > "index": "analyzed", > "analyzer": "msg_excp_analyzer" > }, > "@msg": { > "type": "string", > "index": "analyzed", > "analyzer": "msg_excp_analyzer" > } > } > } > } > } > I create the index bogstash-1. > Now I test the following: > curl -XGET > 'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d > 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated > attributes=gps_lng: 183731222/ gps_lat: 289309222/ )' > and it returns: > { > "tokens" : [ { > "token" : "Service=MyMDB.onMessage", > "start_offset" : 0, > "end_offset" : 23, > "type" : "word", > "position" : 1 > }, { > "token" : "appId=cs", > "start_offset" : 24, > "end_offset" : 32, > "type" : "word", > "position" : 2 > }, { > "token" : "Times=Me:22/Total:22", > "start_offset" : 33, > "end_offset" : 53, > "type" : "word", > "position" : 3 > }, { > "token" : "(updated", > "start_offset" : 54, > "end_offset" : 62, > "type" : "word", > "position" : 4 > }, { > "token" : "attributes=gps_lng:", > "start_offset" : 63, > "end_offset" : 82, > "type" : "word", > "position" : 5 > }, { > "token" : "183731222/", > "start_offset" : 83, > "end_offset" : 93, > "type" : "word", > "position" : 6 > }, { > "token" : "gps_lat:", > "start_offset" : 94, > "end_offset" : 102, > "type" : "word", > "position" : 7 > }, { > "token" : "289309222/", > "start_offset" : 103, > "end_offset" : 113, > "type" : "word", > "position" : 8 > }, { > "token" : ")", > "start_offset" : 114, > "end_offset" : 115, > "type" : "word", > "position" : 9 > } ] > } > Which is the output of a standard analyzer. > Giving the tokenizer and filters in the analyze API directly works fine: > curl -XGET > 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1' > -d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated > attributes=gps_lng: 183731222/ gps_lat: 289309222/ )' > This results in: > { > "tokens" : [ { > "token" : "service", > "start_offset" : 0, > "end_offset" : 7, > "type" : "word", > "position" : 1 > }, { > "token" : "service mymdb", > "start_offset" : 0, > "end_offset" : 13, > "type" : "shingle", > "position" : 1 > }, { > "token" : "mymdb", > "start_offset" : 8, > "end_offset" : 13, > "type" : "word", > "position" : 2 > }, { > "token" : "mymdb onmessage", > "start_offset" : 8, > "end_offset" : 23, > "type" : "shingle", > "position" : 2 > }, { > "token" : "onmessage", > "start_offset" : 14, > "end_offset" : 23, > "type" : "word", > "position" : 3 > }, { > "token" : "onmessage appid", > "start_offset" : 14, > "end_offset" : 29, > "type" : "shingle", > "position" : 3 > }, { > "token" : "appid", > "start_offset" : 24, > "end_offset" : 29, > "type" : "word", > "position" : 4 > }, { > "token" : "appid cs", > "start_offset" : 24, > "end_offset" : 32, > "type" : "shingle", > "position" : 4 > }, { > "token" : "cs", > "start_offset" : 30, > "end_offset" : 32, > "type" : "word", > "position" : 5 > }, { > "token" : "cs times", > "start_offset" : 30, > "end_offset" : 38, > "type" : "shingle", > "position" : 5 > }, { > "token" : "times", > "start_offset" : 33, > "end_offset" : 38, > "type" : "word", > "position" : 6 > }, { > "token" : "times me", > "start_offset" : 33, > "end_offset" : 41, > "type" : "shingle", > "position" : 6 > }, { > "token" : "me", > "start_offset" : 39, > "end_offset" : 41, > "type" : "word", > "position" : 7 > }, { > "token" : "me 22", > "start_offset" : 39, > "end_offset" : 44, > "type" : "shingle", > "position" : 7 > }, { > "token" : "22", > "start_offset" : 42, > "end_offset" : 44, > "type" : "word", > "position" : 8 > }, { > "token" : "22 total", > "start_offset" : 42, > "end_offset" : 50, > "type" : "shingle", > "position" : 8 > }, { > "token" : "total", > "start_offset" : 45, > "end_offset" : 50, > "type" : "word", > "position" : 9 > }, { > "token" : "total 22", > "start_offset" : 45, > "end_offset" : 53, > "type" : "shingle", > "position" : 9 > }, { > "token" : "22", > "start_offset" : 51, > "end_offset" : 53, > "type" : "word", > "position" : 10 > }, { > "token" : "22 updated", > "start_offset" : 51, > "end_offset" : 62, > "type" : "shingle", > "position" : 10 > }, { > "token" : "updated", > "start_offset" : 55, > "end_offset" : 62, > "type" : "word", > "position" : 11 > }, { > "token" : "updated attributes", > "start_offset" : 55, > "end_offset" : 73, > "type" : "shingle", > "position" : 11 > }, { > "token" : "attributes", > "start_offset" : 63, > "end_offset" : 73, > "type" : "word", > "position" : 12 > }, { > "token" : "attributes gps", > "start_offset" : 63, > "end_offset" : 77, > "type" : "shingle", > "position" : 12 > }, { > "token" : "gps", > "start_offset" : 74, > "end_offset" : 77, > "type" : "word", > "position" : 13 > }, { > "token" : "gps lng", > "start_offset" : 74, > "end_offset" : 81, > "type" : "shingle", > "position" : 13 > }, { > "token" : "lng", > "start_offset" : 78, > "end_offset" : 81, > "type" : "word", > "position" : 14 > }, { > "token" : "lng 183731222", > "start_offset" : 78, > "end_offset" : 92, > "type" : "shingle", > "position" : 14 > }, { > "token" : "183731222", > "start_offset" : 83, > "end_offset" : 92, > "type" : "word", > "position" : 15 > }, { > "token" : "183731222 gps", > "start_offset" : 83, > "end_offset" : 97, > "type" : "shingle", > "position" : 15 > }, { > "token" : "gps", > "start_offset" : 94, > "end_offset" : 97, > "type" : "word", > "position" : 16 > }, { > "token" : "gps lat", > "start_offset" : 94, > "end_offset" : 101, > "type" : "shingle", > "position" : 16 > }, { > "token" : "lat", > "start_offset" : 98, > "end_offset" : 101, > "type" : "word", > "position" : 17 > }, { > "token" : "lat 289309222", > "start_offset" : 98, > "end_offset" : 112, > "type" : "shingle", > "position" : 17 > }, { > "token" : "289309222", > "start_offset" : 103, > "end_offset" : 112, > "type" : "word", > "position" : 18 > } ] > } > > > So it seems the template is not used?! Any obvious reason/mistakes? > > Thx, > Marc > > > > > On Thursday, August 28, 2014 6:17:08 PM UTC+2, Ivan Brusic wrote: >> >> Use the Analyze API to view what tokens are being generated? Keep it >> simple at first (maybe remove shingles) and build up as you encounter more >> edge-cases. What kind of query are you using? >> >> -- >> Ivan >> >> >> On Thu, Aug 28, 2014 at 2:05 AM, Marc <mn.o...@googlemail.com> wrote: >> >>> Hi Ivan, >>> >>> thanks for the help. Now it works almost... ;) >>> I have used the following: >>> "analysis": { >>> "analyzer": { >>> "msg_excp_analyzer": { >>> "type": "custom", >>> "tokenizer": "whitespace", >>> "filters": ["split-up", >>> "lowercase", >>> "shingle", >>> "ascii-folding"] >>> } >>> }, >>> "filter": { >>> "split-up": { >>> "type": "word_delimiter", >>> "preserve_original": "true", >>> "catenate_all": "true", >>> "type_table": { >>> "$": "DIGIT", >>> "%": "DIGIT", >>> ".": "DIGIT", >>> ",": "DIGIT", >>> ":": "DIGIT", >>> "/": "DIGIT", >>> "\\": "DIGIT", >>> "=": "DIGIT", >>> "&": "DIGIT", >>> "(": "DIGIT", >>> ")": "DIGIT", >>> "<": "DIGIT", >>> ">": "DIGIT", >>> "\\U+000A": "DIGIT" >>> } >>> }, >>> "ascii-folding": { >>> "type": "asciifolding", >>> "preserve_original": true >>> } >>> } >>> If the above is wrong or not reasonable, please feel free to criticize! >>> >>> >>> Now the only thing that does not work is searching for subwords of >>> concatenations with".". >>> Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 >>> (updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot >>> search for MyMDB or onMessage; only MyMDB.onMessage will work. >>> >>> Anymore Ideas? >>> >>> Cheers, >>> Marc >>> >>> >>> >>> On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote: >>> >>>> Off the top of my head, I would use a custom analyzer with a whitespace >>>> tokenizer and a word delimiter filter (preserving the original tokens as >>>> well). Perhaps a shingle filter to create bigrams. Or better yet a pattern >>>> tokenizer with spaces and parenthesis. >>>> >>>> Cheers, >>>> >>>> Ivan >>>> >>>> >>>> On Tue, Aug 26, 2014 at 11:57 PM, Marc <mn.o...@googlemail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have quiet a simple scenario that already gives me a headache for >>>>> quiet a while. >>>>> I have one Field which is quiet big and full of special characters >>>>> like (,),=,:,",' digits and text. >>>>> Example: >>>>> "msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 >>>>> (updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )" >>>>> I essentially want to be able to search this things using text, >>>>> wildcards etc. >>>>> So far I have tried not analyzing the content and using the wildcard >>>>> search and it doesn't work very well. >>>>> Using different tokenizers and the query_string query also only works >>>>> to a certain degree. >>>>> For example I want to be able to serach for following expressions: >>>>> Service >>>>> MyMDB >>>>> onMessage >>>>> MyMDB.onMessage >>>>> appId=cs AND Times=Me:22 >>>>> >>>>> and other possible permutations. >>>>> What is a correct setup?! I simply can't find a solution... >>>>> >>>>> ps.: the data is imported to elasticsearch using logstash. We do acces >>>>> the data using the java api (all software latest versions). >>>>> >>>>> >>>>> Cheeers, >>>>> Marc >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to elasticsearc...@googlegroups.com. >>>>> >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAU6AchiAU6F%2BbTf0LyOecL2YjpLY%2B_e_KF2SjuQWKL0g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.