Hi there, We're using ES for web analytics purposes and so far, have loved the experience. We create hourly indexes that contain only one type of "url" document which has multiple metrics fields like "page_views". We've recently begun looking into how to store more complex metrics that require set arithmetic such as "unique views" or "unique visitors".
While the cardinality aggregation <http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html> is awesome, it seems like it'd be crazy for us to store all the user IDs that we saw even for an hour on certain URLs as the number could grow to be very large, very quickly. Just to clarify, this is the document schema I'm saying would probably be silly: { "url": "http://example.com/", "hour": "2014-05-31T03:00:00" "user_ids": [ "e4c88ac4-ccc7-49e0-9a2e-34ab24420d2b", "252d0f6e-2e9d-487d-95f4-ac3d53cce977", "90b5d83b-44d6-4462-9f4b-3ab41e75143e", "b6c9d0f8-5e4f-4308-92eb-be68d7b06d78", "7a097ac1-7410-4918-a780-0020197d0b14" ], "metrics": { "page_views": 100 } } Being fairly new to Lucene and ES, I don't really know what a massive (> 100K) user_ids array per document would do to ES/Lucene at indexing or query time. In addition, although that structure would allow us to query for hourly URLs that contained a certain user_id, it's probably beyond our current scope. Precomputing the unique number per hour doesn't help us when we want to perform aggregations at query time and know unique users across a series of hours. Toying around with two approaches in my head, and I wanted to get some feedback: 1. Find a way to store only the HLL object in ES but without the actual array of distinct values. This way, we have the benefit of the cardinality aggregations, but without storing the full set of user_ids. Is there a way to do this? 2. Store a binary blob which represents a custom HLL that we'll create and index. Create a new aggregation for a bitwise OR operation on that binary object which would allow us to union the HLLs in the aggregation and return that result I lean a little bit more to solution #2 only because we'd prefer to have the HLL's accuracy tuneable instead of rely in ES defaults. Would love to hear some thoughts on how to solve this kind of issue. Mike -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1af2370f-c402-44ac-b05d-fe0b1bee00a8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
