Re: [influxdb] Schema design: put user-id in series name or tag

Sean Beckett Mon, 31 Oct 2016 08:43:55 -0700

On Sat, Oct 29, 2016 at 1:18 PM, <[email protected]> wrote:

> I am investigating the use of InfluxDB for storing statistics in our
> project. So far I love the features and ease of use of InfluxDB, but I am
> worried about the details around series cardinality in our use case.
>
> Basically, we have to keep track of a number of statistics per user.
>
> We expect two kinds of queries:
> A. Queries by users limited to their own statistics
> B. System wide queries by for our own benefit *probably* with no
> conditions on user-id´s.
>
> The easiest approach would be to store the user-id as a tag value.
> However, we expect a regular influx (pun intended ;-)) of new users and
> deactivation of old users. Unless we can somehow clean up the existing data
> by removing old users, this would mean the series cardinality would always
> go up, eventually getting us into trouble.
>
> The alternative would be to store all data twice:
> 1. In per-user series for user queries (user-id in the series name)
> 2. In a system-wide series without user-id info for our own system wide
> queries
>
> Not ideal, but it might be workable.
>
> As always, the devil is in the details. We expect a maximum of about 10000
> new users per year, with a maximum of about 50000 active users at any one
> time. The basic cardinality without user-id is about 100.
>
> This means that in ten years the cardinality would grow to about
> (50000+10*10000)*100 = 15 million. This would put it in the category of
> ¨probably infeasible¨ in the general hardware guidelines for a single node.
>


So every user always has all 100 other dimensions? And those dimensions are
100% independent of each other? See
https://docs.influxdata.com/influxdb/v1.0//concepts/glossary/#series-cardinality
for more on dependent vs. independent tags.

I suspect your total actual cardinality will be much lower than 15 million.



> I suspect that the problem with such large cardinality is the memory
> required for the index. Is there any way to estimate what that memory
> requirement would be?
>

It's highly dependent on the string length of your tag keys and values and
the shape of the metadata. E.g 100 measurements of 1 series each will be
different from 100 measurements of 1 series each. That makes it basically
impossible to calculate, but if you really do need 15 million series,
that's going to require in the neighborhood of 128-256GB of RAM.


> Would this high cardinality be less of an issue in a multi-node setup?
>

Yes. If you have, for example, 6 data nodes with a replication factor of 2
for redundancy, then each node is only handling 1/3 of the total series
count. 5 million series per node is still very significant, but with proper
schema and lots of RAM, it is probably feasible.


> Are there any plans to mitigate the cardinality issues in such a use case?
>

https://github.com/influxdata/influxdb/issues/7151


> Would the second approach (storing the data twice) actually help, or would
> it require the same amount of memory (or even more) than the
> straightforward approach?
>

Slightly more than double, would be my guess. The in-RAM index is per
InfluxDB instance, not per database or per series. There's no way to break
it down. The total series index for all databases must (currently) always
live in RAM.


> I would very much appreciate any feedback on these issues as at this point
> in the development of our project it is relatively easy to pick an
> approach. A migration later on would be rather costly.
>
> Regards,
> Pieter.
>
> --
> Remember to include the version number!
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/b5794cef-f4e0-4bbb-ab3b-11ab63fe2a8f%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvOa4aOu9EB7_bpQ8r3sH5XfhJY-ODkvm7DmCfS4W3K%2BtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Schema design: put user-id in series name or tag

Reply via email to