On Monday, October 31, 2016 at 4:43:40 PM UTC+1, Sean Beckett wrote:
> 
> So every user always has all 100 other dimensions? And those dimensions are 
> 100% independent of each other? See 
> https://docs.influxdata.com/influxdb/v1.0//concepts/glossary/#series-cardinality
>  for more on dependent vs. independent tags.

The tag values are almost completely independent of each other. There are three 
independent tags, one with 8 possible values, one with 6 (for now, in the 
future the number of values for this one might actually increase), and one 
boolean tag. 8*6*2=96. There is a dependency between the user-id and the 
8-values tag: some users have only 3 different values for this tag, some 5, and 
some all eight. Similarly, some users only have a single value for the boolean 
tag, but some have two values. So a better estimate might be 5*6*1.5=45.

Unfortunately, I did not realize that the number of measurements also factor 
into the cardinality of the database. We have 7 different measurements, all 
with the same tags but different values. I guess the cardinality is actually 
7*45=315 before taking the user-id into account. This makes the issue a factor 
3 worse.

Also, any extension (new tag, new measurement, increase of tag values) could 
potentially kill our project. Not a good place to be.

> It's highly dependent on the string length of your tag keys and values and 
> the shape of the metadata. 

I would not have expected tag key length to be a factory but I guess this makes 
sense as InfluxDB is schema-less so tags can be added later at will.

> E.g 100 measurements of 1 series each will be 
> different from 100 measurements of 1 series each. 

I think you made a typo here somewhere because I read the same phrase twice.

> That makes it basically impossible to calculate, but if you really do need 15 
> million series, that's going to require in the neighborhood of 128-256GB of 
> RAM.

I understand it is difficult to estimate, but roughly 9-18KB per series just 
for the index sounds like a lot. But then again, I am no expert in time series 
databases, so what do I know. I will stick with your rough estimate for my 
feasibility study.

> Would this high cardinality be less of an issue in a multi-node setup?
> 
> 
> 
> Yes. If you have, for example, 6 data nodes with a replication factor of 2 
> for redundancy, then each node is only handling 1/3 of the total series
> count. 5 million series per node is still very significant, but with proper 
> schema and lots of RAM, it is probably feasible.

That is good news. Of course, the "7 measurements" factor would require 7 times 
the servers or RAM, which does not sound feasible.

>  Are there any plans to mitigate the cardinality issues in such a use case?
>  
> https://github.com/influxdata/influxdb/issues/7151

That is great news ;-)

>  Would the second approach (storing the data twice) actually help, or would 
> it require the same amount of memory (or even more) than the straightforward 
> approach?
> 
> 
> 
> Slightly more than double, would be my guess. The in-RAM index is per 
> InfluxDB instance, not per database or per series. There's no way to break it 
> down. The total series index for all databases must (currently) always live 
> in RAM.

I also deem this good news as I can forget about the ugly approach and focus on 
the straightforward one :-)

In the end it might boil down to a solution for issue 7151 for our use-case to 
be feasible.

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/3c17e0ba-b86f-4767-b603-cdacd7881b6d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to