Hey, as these two sample messages a very different in nature, it is hard to use something like scripting to cut those messages off after a certain length as a workaround. I would go with some sort of preprocessing (maybe using logstash), where you give each message a certain type/identifier and facet on that one.
--Alex On Wed, Apr 9, 2014 at 7:34 PM, John Stanford <[email protected]> wrote: > Here's an example. If I use aggregations to search for the top 10 most > frequent messages: > > POST _search > { > "query": { > "match": { > "loglevel": "error" > } > }, > "aggs": { > "freqent_msgs": { > "terms": { > "field": "message.raw", > "size": 10 > } > } > } > } > > > I end up with a list that exhibit two undesirable characteristics. The > top 3 entries are the same type of message, but have different instances. > The remaining messages are a few different types, but each of them has a > repetitive counter. Is there a way to overlook these differences so the > result would be closer to the 4 message types? > > "aggregations": { > "freqent_msgs": { > "buckets": [ > { > "key": "Getting disk size of instance-0000bcbb: [Errno 2] > No such file or directory: > '/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'", > "doc_count": 22599 > }, > { > "key": "Getting disk size of instance-0000bd08: [Errno 2] > No such file or directory: > '/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'", > "doc_count": 13447 > }, > { > "key": "Getting disk size of instance-0000bd09: [Errno 2] > No such file or directory: > '/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'", > "doc_count": 13447 > }, > { > "key": "Unable to connect to AMQP server: [Errno 113] > EHOSTUNREACH. Sleeping 60 seconds", > "doc_count": 32 > }, > { > "key": "Unable to connect to AMQP server: [Errno 113] > EHOSTUNREACH. Sleeping 32 seconds", > "doc_count": 15 > }, > { > "key": "Unable to connect to AMQP server: [Errno 111] > ECONNREFUSED. Sleeping 2 seconds", > "doc_count": 12 > }, > { > "key": "Unable to connect to AMQP server: [Errno 111] > ECONNREFUSED. Sleeping 4 seconds", > "doc_count": 10 > }, > { > "key": "Unable to connect to AMQP server: [Errno 111] > ECONNREFUSED. Sleeping 8 seconds", > "doc_count": 9 > }, > { > "key": "Unable to connect to AMQP server: [Errno 110] > ETIMEDOUT. Sleeping 16 seconds", > "doc_count": 7 > }, > { > "key": "Unable to connect to AMQP server: [Errno 111] > ECONNREFUSED. Sleeping 1 seconds", > "doc_count": 7 > } > ] > } > } > > Thanks, > John > > On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote: >> >> Hi, >> >> I have a bunch of text events indexed as a message field, and in many >> cases, they are similar but not exactly the same. Is there a way to return >> the top n most frequently occurring similar phrases, and if so, how would I >> control the definition of similar? >> >> Thanks, >> John >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_OoWWp1nBVdwkWriSk4zFftEr2hRX%3DTAsx8vMT2StfQA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
