I continue my battle with series cardinality issues, while still maintaining summarized data through continuous queries long term.
Here's my latest approach, that doesn't work, and why: Background: Trying to get counts of a highly variable tag, which contains domain name. It needs to be a tag to be "grouped by", along with other tags. Stage 0: At the ingestion level, I now split data into multiple measurements (via multiple UDP listeners), some with low cardinality to keep 'raw' data forever, some for various other processing: ex: 'domains' Stage 1: 'domains' measurement comes in with many tags - high cardinality, but 2 hour retention policy Stage 2: CQ_a: count of Stage1 data, grouped by tags of interest time(1h), 2 hour retention policy Stage 3: CQ_b: sum of Stage2 data, further reducing tags, and cardinality with a where constraint. time(1h), 2 day retention policy Stage 4: CQ_c: sum of Stage3 data, same tags as Stage3, now with time(1d), 2 day retention policy Stage 5: CQ_d: top(100,domain), and selecting many tags to be stored as fields, time(1d) Forever retention policy. Everything works great, until Stage 5. Because it is doing one-day summaries, and I reduced it to only one tag, which is not 'domain', the series is no longer unique enough to accommodate more than one data point per timestamp. Because it's being written by a continuous query, it is grouping on time(1d) and selects into the new measurement with only a single daily timestamp value. As a result, I go from 3500 records to 64 (one per the single remaining tag), which is not helpful. Normal solutions to this problem would be: 1) Add back the domain tag - Not feasible for me due to high cardinality over time. 2) Increment timestamp - This is not supported by integrated CQs, and I don't see that it is easily supported by Kapacitor. Mentioned here <https://groups.google.com/d/topic/influxdb/FFMmfTJ2pGg/discussion> and here <https://github.com/influxdata/influxdb/issues/4614>. I think there was a feature request related to this, but shot down. Can't find it again though. I see two possible solutions: i) Using influx to ingest the data, do some summing, and then I'll try to put it into postgresql for long-term retrival, which should be fine once sufficiently summarized. Maybe using www_fdw <https://github.com/cyga/www_fdw/wiki/Documentation> in postgresql to query influx, otherwise just use an outside script which queries influx, and inserts into postgresql. ii) Query the data from the top() command using a script, emulate the actions of the continuous query -- re-inserting it; but modify the daily timestamp so that each item returned from the top(100, domain) has an incremental timestamp: ending in 1,2,3 I'm not sure if this is in any way a 'normal' use of influxdb, whether there's a feature request in here; or if it'd just be solved by the long-term plan to not require all tags to be in ram, which will reduce (eliminate) the whole cardinality constraint. -- Remember to include the InfluxDB version number with all issue reports --- You received this message because you are subscribed to the Google Groups "InfluxDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/05ddb9b6-6eea-4bee-a069-b0dd38577296%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
