Hi there,

We're using ES for web analytics purposes and so far, have loved the 
experience.  We create hourly indexes that contain only one type of "url" 
document which has multiple metrics fields like "page_views".  We've 
recently begun looking into how to store more complex metrics that require 
set arithmetic such as "unique views" or "unique visitors".

While the cardinality aggregation 
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html>
 is 
awesome, it seems like it'd be crazy for us to store all the user IDs that 
we saw even for an hour on certain URLs as the number could grow to be very 
large, very quickly.  Just to clarify, this is the document schema I'm 
saying would probably be silly:

{
    "url": "http://example.com/";,
    "hour": "2014-05-31T03:00:00"
    "user_ids": [
        "e4c88ac4-ccc7-49e0-9a2e-34ab24420d2b",
        "252d0f6e-2e9d-487d-95f4-ac3d53cce977",
        "90b5d83b-44d6-4462-9f4b-3ab41e75143e",
        "b6c9d0f8-5e4f-4308-92eb-be68d7b06d78",
        "7a097ac1-7410-4918-a780-0020197d0b14"
    ],
    "metrics": {
        "page_views": 100
    }
}

Being fairly new to Lucene and ES, I don't really know what a massive (> 
100K) user_ids array per document would do to ES/Lucene at indexing or 
query time. In addition, although that structure would allow us to query 
for hourly URLs that contained a certain user_id, it's probably beyond our 
current scope.  Precomputing the unique number per hour doesn't help us 
when we want to perform aggregations at query time and know unique users 
across a series of hours.

Toying around with two approaches in my head, and I wanted to get some 
feedback:


   1. Find a way to store only the HLL object in ES but without the actual 
   array of distinct values.  This way, we have the benefit of the cardinality 
   aggregations, but without storing the full set of user_ids.  Is there a way 
   to do this?
   2. Store a binary blob which represents a custom HLL that we'll create 
   and index.  Create a new aggregation for a bitwise OR operation on that 
   binary object which would allow us to union the HLLs in the aggregation and 
   return that result

I lean a little bit more to solution #2 only because we'd prefer to have 
the HLL's accuracy tuneable instead of rely in ES defaults.

Would love to hear some thoughts on how to solve this kind of issue.

Mike

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1af2370f-c402-44ac-b05d-fe0b1bee00a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to