Here's an example.  If I use aggregations to search for the top 10 most 
frequent messages:

POST _search
{
  "query": {
    "match": {
      "loglevel": "error"
    }
  },
  "aggs": {
    "freqent_msgs": {
      "terms": {
        "field": "message.raw",
        "size": 10
      }
    }
  }
}


I end up with a list that exhibit two undesirable characteristics.  The top 
3 entries are the same type of message, but have different instances.  The 
remaining messages are a few different types, but each of them has a 
repetitive counter.  Is there a way to overlook these differences so the 
result would be closer to the 4 message types?  

   "aggregations": {
      "freqent_msgs": {
         "buckets": [
            {
               "key": "Getting disk size of instance-0000bcbb: [Errno 2] No 
such file or directory: 
'/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
               "doc_count": 22599
            },
            {
               "key": "Getting disk size of instance-0000bd08: [Errno 2] No 
such file or directory: 
'/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
               "doc_count": 13447
            },
            {
               "key": "Getting disk size of instance-0000bd09: [Errno 2] No 
such file or directory: 
'/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
               "doc_count": 13447
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 113] 
EHOSTUNREACH. Sleeping 60 seconds",
               "doc_count": 32
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 113] 
EHOSTUNREACH. Sleeping 32 seconds",
               "doc_count": 15
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 111] 
ECONNREFUSED. Sleeping 2 seconds",
               "doc_count": 12
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 111] 
ECONNREFUSED. Sleeping 4 seconds",
               "doc_count": 10
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 111] 
ECONNREFUSED. Sleeping 8 seconds",
               "doc_count": 9
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 110] 
ETIMEDOUT. Sleeping 16 seconds",
               "doc_count": 7
            },
            {
               "key": "Unable to connect to AMQP server: [Errno 111] 
ECONNREFUSED. Sleeping 1 seconds",
               "doc_count": 7
            }
         ]
      }
   }

Thanks,
John

On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:
>
> Hi,
>
> I have a bunch of text events indexed as a message field, and in many 
> cases, they are similar but not exactly the same.  Is there a way to return 
> the top n most frequently occurring similar phrases, and if so, how would I 
> control the definition of similar?
>
> Thanks,
> John
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to