On Monday, October 31, 2016 at 4:43:40 PM UTC+1, Sean Beckett wrote: > > So every user always has all 100 other dimensions? And those dimensions are > 100% independent of each other? See > https://docs.influxdata.com/influxdb/v1.0//concepts/glossary/#series-cardinality > for more on dependent vs. independent tags.
The tag values are almost completely independent of each other. There are three independent tags, one with 8 possible values, one with 6 (for now, in the future the number of values for this one might actually increase), and one boolean tag. 8*6*2=96. There is a dependency between the user-id and the 8-values tag: some users have only 3 different values for this tag, some 5, and some all eight. Similarly, some users only have a single value for the boolean tag, but some have two values. So a better estimate might be 5*6*1.5=45. Unfortunately, I did not realize that the number of measurements also factor into the cardinality of the database. We have 7 different measurements, all with the same tags but different values. I guess the cardinality is actually 7*45=315 before taking the user-id into account. This makes the issue a factor 3 worse. Also, any extension (new tag, new measurement, increase of tag values) could potentially kill our project. Not a good place to be. > It's highly dependent on the string length of your tag keys and values and > the shape of the metadata. I would not have expected tag key length to be a factory but I guess this makes sense as InfluxDB is schema-less so tags can be added later at will. > E.g 100 measurements of 1 series each will be > different from 100 measurements of 1 series each. I think you made a typo here somewhere because I read the same phrase twice. > That makes it basically impossible to calculate, but if you really do need 15 > million series, that's going to require in the neighborhood of 128-256GB of > RAM. I understand it is difficult to estimate, but roughly 9-18KB per series just for the index sounds like a lot. But then again, I am no expert in time series databases, so what do I know. I will stick with your rough estimate for my feasibility study. > Would this high cardinality be less of an issue in a multi-node setup? > > > > Yes. If you have, for example, 6 data nodes with a replication factor of 2 > for redundancy, then each node is only handling 1/3 of the total series > count. 5 million series per node is still very significant, but with proper > schema and lots of RAM, it is probably feasible. That is good news. Of course, the "7 measurements" factor would require 7 times the servers or RAM, which does not sound feasible. > Are there any plans to mitigate the cardinality issues in such a use case? > > https://github.com/influxdata/influxdb/issues/7151 That is great news ;-) > Would the second approach (storing the data twice) actually help, or would > it require the same amount of memory (or even more) than the straightforward > approach? > > > > Slightly more than double, would be my guess. The in-RAM index is per > InfluxDB instance, not per database or per series. There's no way to break it > down. The total series index for all databases must (currently) always live > in RAM. I also deem this good news as I can forget about the ugly approach and focus on the straightforward one :-) In the end it might boil down to a solution for issue 7151 for our use-case to be feasible. -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/3c17e0ba-b86f-4767-b603-cdacd7881b6d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
