Re: most frequently occurring phrases?

John Stanford Fri, 11 Apr 2014 07:00:47 -0700

Hi Alex,

Yeah, I'm doing that with some other message types, but was hoping to keep that 
to select messages with metrics in them.  I may look into some post processing 
strategies, and will keep searching for a reasonable solution within 
elasticsearch.


Thanks,

John

> On Apr 10, 2014, at 11:05 PM, Alexander Reelsen <[email protected]> wrote:
> 
> Hey,
> 
> as these two sample messages a very different in nature, it is hard to use 
> something like scripting to cut those messages off after a certain length as 
> a workaround. I would go with some sort of preprocessing (maybe using 
> logstash), where you give each message a certain type/identifier and facet on 
> that one.
> 
> 
> --Alex
> 
> 
>> On Wed, Apr 9, 2014 at 7:34 PM, John Stanford <[email protected]> wrote:
>> Here's an example.  If I use aggregations to search for the top 10 most 
>> frequent messages:
>> 
>> POST _search
>> {
>>   "query": {
>>     "match": {
>>       "loglevel": "error"
>>     }
>>   },
>>   "aggs": {
>>     "freqent_msgs": {
>>       "terms": {
>>         "field": "message.raw",
>>         "size": 10
>>       }
>>     }
>>   }
>> }
>> 
>> 
>> I end up with a list that exhibit two undesirable characteristics.  The top 
>> 3 entries are the same type of message, but have different instances.  The 
>> remaining messages are a few different types, but each of them has a 
>> repetitive counter.  Is there a way to overlook these differences so the 
>> result would be closer to the 4 message types?  
>> 
>>    "aggregations": {
>>       "freqent_msgs": {
>>          "buckets": [
>>             {
>>                "key": "Getting disk size of instance-0000bcbb: [Errno 2] No 
>> such file or directory: 
>> '/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
>>                "doc_count": 22599
>>             },
>>             {
>>                "key": "Getting disk size of instance-0000bd08: [Errno 2] No 
>> such file or directory: 
>> '/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
>>                "doc_count": 13447
>>             },
>>             {
>>                "key": "Getting disk size of instance-0000bd09: [Errno 2] No 
>> such file or directory: 
>> '/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
>>                "doc_count": 13447
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 113] 
>> EHOSTUNREACH. Sleeping 60 seconds",
>>                "doc_count": 32
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 113] 
>> EHOSTUNREACH. Sleeping 32 seconds",
>>                "doc_count": 15
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 111] 
>> ECONNREFUSED. Sleeping 2 seconds",
>>                "doc_count": 12
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 111] 
>> ECONNREFUSED. Sleeping 4 seconds",
>>                "doc_count": 10
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 111] 
>> ECONNREFUSED. Sleeping 8 seconds",
>>                "doc_count": 9
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 110] 
>> ETIMEDOUT. Sleeping 16 seconds",
>>                "doc_count": 7
>>             },
>>             {
>>                "key": "Unable to connect to AMQP server: [Errno 111] 
>> ECONNREFUSED. Sleeping 1 seconds",
>>                "doc_count": 7
>>             }
>>          ]
>>       }
>>    }
>> 
>> Thanks,
>> John
>> 
>>> On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:
>>> Hi,
>>> 
>>> I have a bunch of text events indexed as a message field, and in many 
>>> cases, they are similar but not exactly the same.  Is there a way to return 
>>> the top n most frequently occurring similar phrases, and if so, how would I 
>>> control the definition of similar?
>>> 
>>> Thanks,
>>> John
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com.
>> 
>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/elasticsearch/9bQdUgTQqgU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_OoWWp1nBVdwkWriSk4zFftEr2hRX%3DTAsx8vMT2StfQA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/14301933-4556-4F89-BB5E-B4E9A3F79D3E%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: most frequently occurring phrases?

Reply via email to