Re: EL setup for fulltext search

Ivan Brusic Fri, 29 Aug 2014 09:50:04 -0700

That output does not look like the something generated from the standard
analyzer since it contains uppercase letters and various non-word
characters such as '='.


Your two analysis requests will differ since the second one contains the
default word_delimiter filter instead of your custom my_word_delimiter.
What you are trying to achieve is somewhat difficult, but you can get there
if you keep on tweaking. :) Try using a pattern tokenizer instead of the
whitespace tokenizer if you want more control over word boundaries.

-- 
Ivan


On Fri, Aug 29, 2014 at 1:48 AM, Marc <mn.off...@googlemail.com> wrote:

> Hi Ivan,
>
> thanks again. I have tried so and found a reasonable combination.
> Nevertheless, when I now try to use the analyze api with an index that has
> the said analyzer defined via template it doesn't seem to apply:
>
> This is the complete template:
> {
>     "template": "bogstash-*",
>     "settings": {
>         "index.number_of_replicas": 0,
>         "analysis": {
>             "analyzer": {
>                 "msg_excp_analyzer": {
>                     "type": "custom",
>                     "tokenizer": "whitespace",
>                     "filters": ["word_delimiter",
>                     "lowercase",
>                     "asciifolding",
>                     "shingle",
>                     "standard"]
>                 }
>             },
>             "filters": {
>                 "my_word_delimiter": {
>                     "type": "word_delimiter",
>                     "preserve_original": "true"
>                 },
>                 "my_asciifolding": {
>                     "type": "asciifolding",
>                     "preserve_original": true
>                 }
>             }
>         }
>     },
>     "mappings": {
>         "_default_": {
>             "properties": {
>                 "@excp": {
>                     "type": "string",
>                     "index": "analyzed",
>                     "analyzer": "msg_excp_analyzer"
>                 },
>                 "@msg": {
>                     "type": "string",
>                     "index": "analyzed",
>                     "analyzer": "msg_excp_analyzer"
>                 }
>             }
>         }
>     }
> }
> I create the index bogstash-1.
> Now I test the following:
> curl -XGET
> 'localhost:9200/bogstash-1/_analyze?analyzer=msg_excp_analyzer&pretty=1' -d
> 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
> attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
> and it returns:
> {
>   "tokens" : [ {
>     "token" : "Service=MyMDB.onMessage",
>     "start_offset" : 0,
>     "end_offset" : 23,
>     "type" : "word",
>     "position" : 1
>   }, {
>     "token" : "appId=cs",
>     "start_offset" : 24,
>     "end_offset" : 32,
>     "type" : "word",
>     "position" : 2
>   }, {
>     "token" : "Times=Me:22/Total:22",
>     "start_offset" : 33,
>     "end_offset" : 53,
>     "type" : "word",
>     "position" : 3
>   }, {
>     "token" : "(updated",
>     "start_offset" : 54,
>     "end_offset" : 62,
>     "type" : "word",
>     "position" : 4
>   }, {
>     "token" : "attributes=gps_lng:",
>     "start_offset" : 63,
>     "end_offset" : 82,
>     "type" : "word",
>     "position" : 5
>   }, {
>     "token" : "183731222/",
>     "start_offset" : 83,
>     "end_offset" : 93,
>     "type" : "word",
>     "position" : 6
>   }, {
>     "token" : "gps_lat:",
>     "start_offset" : 94,
>     "end_offset" : 102,
>     "type" : "word",
>     "position" : 7
>   }, {
>     "token" : "289309222/",
>     "start_offset" : 103,
>     "end_offset" : 113,
>     "type" : "word",
>     "position" : 8
>   }, {
>     "token" : ")",
>     "start_offset" : 114,
>     "end_offset" : 115,
>     "type" : "word",
>     "position" : 9
>   } ]
> }
> Which is the output of a standard analyzer.
> Giving the tokenizer and filters in the analyze API directly works fine:
> curl -XGET
> 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,word_delimiter,shingle,asciifolding,standard&pretty=1'
> -d 'Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22 (updated
> attributes=gps_lng: 183731222/ gps_lat: 289309222/ )'
> This results in:
> {
>   "tokens" : [ {
>     "token" : "service",
>     "start_offset" : 0,
>     "end_offset" : 7,
>     "type" : "word",
>     "position" : 1
>   }, {
>     "token" : "service mymdb",
>     "start_offset" : 0,
>     "end_offset" : 13,
>     "type" : "shingle",
>     "position" : 1
>   }, {
>     "token" : "mymdb",
>     "start_offset" : 8,
>     "end_offset" : 13,
>     "type" : "word",
>     "position" : 2
>   }, {
>     "token" : "mymdb onmessage",
>     "start_offset" : 8,
>     "end_offset" : 23,
>     "type" : "shingle",
>     "position" : 2
>   }, {
>     "token" : "onmessage",
>     "start_offset" : 14,
>     "end_offset" : 23,
>     "type" : "word",
>     "position" : 3
>   }, {
>     "token" : "onmessage appid",
>     "start_offset" : 14,
>     "end_offset" : 29,
>     "type" : "shingle",
>     "position" : 3
>   }, {
>     "token" : "appid",
>     "start_offset" : 24,
>     "end_offset" : 29,
>     "type" : "word",
>     "position" : 4
>   }, {
>     "token" : "appid cs",
>     "start_offset" : 24,
>     "end_offset" : 32,
>     "type" : "shingle",
>     "position" : 4
>   }, {
>     "token" : "cs",
>     "start_offset" : 30,
>     "end_offset" : 32,
>     "type" : "word",
>     "position" : 5
>   }, {
>     "token" : "cs times",
>     "start_offset" : 30,
>     "end_offset" : 38,
>     "type" : "shingle",
>     "position" : 5
>   }, {
>     "token" : "times",
>     "start_offset" : 33,
>     "end_offset" : 38,
>     "type" : "word",
>     "position" : 6
>   }, {
>     "token" : "times me",
>     "start_offset" : 33,
>     "end_offset" : 41,
>     "type" : "shingle",
>     "position" : 6
>   }, {
>     "token" : "me",
>     "start_offset" : 39,
>     "end_offset" : 41,
>     "type" : "word",
>     "position" : 7
>   }, {
>     "token" : "me 22",
>     "start_offset" : 39,
>     "end_offset" : 44,
>     "type" : "shingle",
>     "position" : 7
>   }, {
>     "token" : "22",
>     "start_offset" : 42,
>     "end_offset" : 44,
>     "type" : "word",
>     "position" : 8
>   }, {
>     "token" : "22 total",
>     "start_offset" : 42,
>     "end_offset" : 50,
>     "type" : "shingle",
>     "position" : 8
>   }, {
>     "token" : "total",
>     "start_offset" : 45,
>     "end_offset" : 50,
>     "type" : "word",
>     "position" : 9
>   }, {
>     "token" : "total 22",
>     "start_offset" : 45,
>     "end_offset" : 53,
>     "type" : "shingle",
>     "position" : 9
>   }, {
>     "token" : "22",
>     "start_offset" : 51,
>     "end_offset" : 53,
>     "type" : "word",
>     "position" : 10
>   }, {
>     "token" : "22 updated",
>     "start_offset" : 51,
>     "end_offset" : 62,
>     "type" : "shingle",
>     "position" : 10
>   }, {
>     "token" : "updated",
>     "start_offset" : 55,
>     "end_offset" : 62,
>     "type" : "word",
>     "position" : 11
>   }, {
>     "token" : "updated attributes",
>     "start_offset" : 55,
>     "end_offset" : 73,
>     "type" : "shingle",
>     "position" : 11
>   }, {
>     "token" : "attributes",
>     "start_offset" : 63,
>     "end_offset" : 73,
>     "type" : "word",
>     "position" : 12
>   }, {
>     "token" : "attributes gps",
>     "start_offset" : 63,
>     "end_offset" : 77,
>     "type" : "shingle",
>     "position" : 12
>   }, {
>     "token" : "gps",
>     "start_offset" : 74,
>     "end_offset" : 77,
>     "type" : "word",
>     "position" : 13
>   }, {
>     "token" : "gps lng",
>     "start_offset" : 74,
>     "end_offset" : 81,
>     "type" : "shingle",
>     "position" : 13
>   }, {
>     "token" : "lng",
>     "start_offset" : 78,
>     "end_offset" : 81,
>     "type" : "word",
>     "position" : 14
>   }, {
>     "token" : "lng 183731222",
>     "start_offset" : 78,
>     "end_offset" : 92,
>     "type" : "shingle",
>     "position" : 14
>   }, {
>     "token" : "183731222",
>     "start_offset" : 83,
>     "end_offset" : 92,
>     "type" : "word",
>     "position" : 15
>   }, {
>     "token" : "183731222 gps",
>     "start_offset" : 83,
>     "end_offset" : 97,
>     "type" : "shingle",
>     "position" : 15
>   }, {
>     "token" : "gps",
>     "start_offset" : 94,
>     "end_offset" : 97,
>     "type" : "word",
>     "position" : 16
>   }, {
>     "token" : "gps lat",
>     "start_offset" : 94,
>     "end_offset" : 101,
>     "type" : "shingle",
>     "position" : 16
>   }, {
>     "token" : "lat",
>     "start_offset" : 98,
>     "end_offset" : 101,
>     "type" : "word",
>     "position" : 17
>   }, {
>     "token" : "lat 289309222",
>     "start_offset" : 98,
>     "end_offset" : 112,
>     "type" : "shingle",
>     "position" : 17
>   }, {
>     "token" : "289309222",
>     "start_offset" : 103,
>     "end_offset" : 112,
>     "type" : "word",
>     "position" : 18
>   } ]
> }
>
>
> So it seems the template is not used?! Any obvious reason/mistakes?
>
> Thx,
> Marc
>
>
>
>
> On Thursday, August 28, 2014 6:17:08 PM UTC+2, Ivan Brusic wrote:
>>
>> Use the Analyze API to view what tokens are being generated? Keep it
>> simple at first (maybe remove shingles) and build up as you encounter more
>> edge-cases. What kind of query are you using?
>>
>> --
>> Ivan
>>
>>
>> On Thu, Aug 28, 2014 at 2:05 AM, Marc <mn.o...@googlemail.com> wrote:
>>
>>> Hi Ivan,
>>>
>>> thanks for the help. Now it works almost... ;)
>>> I have used the following:
>>> "analysis": {
>>>             "analyzer": {
>>>                 "msg_excp_analyzer": {
>>>                     "type": "custom",
>>>                     "tokenizer": "whitespace",
>>>                     "filters": ["split-up",
>>>                     "lowercase",
>>>                     "shingle",
>>>                     "ascii-folding"]
>>>                 }
>>>             },
>>>             "filter": {
>>>                 "split-up": {
>>>                     "type": "word_delimiter",
>>>                     "preserve_original": "true",
>>>                     "catenate_all": "true",
>>>                     "type_table": {
>>>                         "$": "DIGIT",
>>>                         "%": "DIGIT",
>>>                         ".": "DIGIT",
>>>                         ",": "DIGIT",
>>>                         ":": "DIGIT",
>>>                         "/": "DIGIT",
>>>                         "\\": "DIGIT",
>>>                         "=": "DIGIT",
>>>                         "&": "DIGIT",
>>>                         "(": "DIGIT",
>>>                         ")": "DIGIT",
>>>                         "<": "DIGIT",
>>>                         ">": "DIGIT",
>>>                         "\\U+000A": "DIGIT"
>>>                     }
>>>                 },
>>>                 "ascii-folding": {
>>>                     "type": "asciifolding",
>>>                     "preserve_original": true
>>>                 }
>>>             }
>>> If the above is wrong or not reasonable, please feel free to criticize!
>>>
>>>
>>> Now the only thing that does not work is searching for subwords of
>>> concatenations with".".
>>> Having log Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
>>> (updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ ) I cannot
>>> search for MyMDB or onMessage; only MyMDB.onMessage will work.
>>>
>>> Anymore Ideas?
>>>
>>> Cheers,
>>> Marc
>>>
>>>
>>>
>>> On Wednesday, August 27, 2014 9:20:49 AM UTC+2, Ivan Brusic wrote:
>>>
>>>> Off the top of my head, I would use a custom analyzer with a whitespace
>>>> tokenizer and a word delimiter filter (preserving the original tokens as
>>>> well). Perhaps a shingle filter to create bigrams. Or better yet a pattern
>>>> tokenizer with spaces and parenthesis.
>>>>
>>>> Cheers,
>>>>
>>>> Ivan
>>>>
>>>>
>>>> On Tue, Aug 26, 2014 at 11:57 PM, Marc <mn.o...@googlemail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have quiet a simple scenario that already gives me a headache for
>>>>> quiet a while.
>>>>> I have one Field which is quiet big and full of special characters
>>>>> like (,),=,:,",' digits and text.
>>>>> Example:
>>>>>  "msg" : "Service=MyMDB.onMessage appId=cs Times=Me:22/Total:22
>>>>> (updated attributes=gps_lng: 183731222/ gps_lat: 289309222/ )"
>>>>> I essentially want to be able to search this things using text,
>>>>> wildcards etc.
>>>>> So far I have tried not analyzing the content and using the wildcard
>>>>> search and it doesn't work very well.
>>>>> Using different tokenizers and the query_string query also only works
>>>>> to a certain degree.
>>>>> For example I want to be able to serach for following expressions:
>>>>> Service
>>>>> MyMDB
>>>>> onMessage
>>>>> MyMDB.onMessage
>>>>> appId=cs AND Times=Me:22
>>>>>
>>>>> and other possible permutations.
>>>>> What is a correct setup?! I simply can't find a solution...
>>>>>
>>>>> ps.: the data is imported to elasticsearch using logstash. We do acces
>>>>> the data using the java api (all software latest versions).
>>>>>
>>>>>
>>>>> Cheeers,
>>>>> Marc
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/ada9c759-41e0-46ad-9941-3a0f2fb7c122%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/a4350999-f089-4b52-bccd-d10821630066%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/cb70139a-da96-41ab-9d6f-a5a2c19bfc0c%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAU6AchiAU6F%2BbTf0LyOecL2YjpLY%2B_e_KF2SjuQWKL0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: EL setup for fulltext search

Reply via email to