[influxdb] Another attempt at reducing Cardinality

Mike Schroll Mon, 08 Aug 2016 13:57:40 -0700

I continue my battle with series cardinality issues, while still 
maintaining summarized data through continuous queries long term.


Here's my latest approach, that doesn't work, and why:

Background: Trying to get counts of a highly variable tag, which contains 
domain name. It needs to be a tag to be "grouped by", along with other tags.

Stage 0: At the ingestion level, I now split data into multiple 
measurements (via multiple UDP listeners), some with low cardinality to 
keep 'raw' data forever, some for various other processing: ex: 'domains'
Stage 1: 'domains' measurement comes in with many tags - high cardinality, 
but 2 hour retention policy
Stage 2: CQ_a: count of Stage1 data, grouped by tags of interest time(1h), 
2 hour retention policy
Stage 3: CQ_b: sum of Stage2 data, further reducing tags, and cardinality 
with a where constraint. time(1h), 2 day retention policy
Stage 4: CQ_c: sum of Stage3 data, same tags as Stage3, now with time(1d), 
2 day retention policy
Stage 5: CQ_d: top(100,domain), and selecting many tags to be stored as 
fields, time(1d) Forever retention policy.

Everything works great, until Stage 5. Because it is doing one-day 
summaries, and I reduced it to only one tag, which is not 'domain', the 
series is no longer unique enough to accommodate more than one data point 
per timestamp. Because it's being written by a continuous query, it is 
grouping on time(1d) and selects into the new measurement with only a 
single daily timestamp value. As a result, I go from 3500 records to 64 
(one per the single remaining tag), which is not helpful.

Normal solutions to this problem would be:
1) Add back the domain tag - Not feasible for me due to high cardinality 
over time.
2) Increment timestamp - This is not supported by integrated CQs, and I 
don't see that it is easily supported by Kapacitor. Mentioned here 
<https://groups.google.com/d/topic/influxdb/FFMmfTJ2pGg/discussion> and here 
<https://github.com/influxdata/influxdb/issues/4614>. I think there was a 
feature request related to this, but shot down. Can't find it again though.

I see two possible solutions:
i) Using influx to ingest the data, do some summing, and then I'll try to 
put it into postgresql for long-term retrival, which should be fine once 
sufficiently summarized. Maybe using www_fdw 
<https://github.com/cyga/www_fdw/wiki/Documentation> in postgresql to query 
influx, otherwise just use an outside script which queries influx, and 
inserts into postgresql.
ii) Query the data from the top() command using a script, emulate the 
actions of the continuous query -- re-inserting it; but modify the daily 
timestamp so that each item returned from the top(100, domain) has an 
incremental timestamp: ending in 1,2,3

I'm not sure if this is in any way a 'normal' use of influxdb, whether 
there's a feature request in here; or if it'd just be solved by the 
long-term plan to not require all tags to be in ram, which will reduce 
(eliminate) the whole cardinality constraint.

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/05ddb9b6-6eea-4bee-a069-b0dd38577296%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Another attempt at reducing Cardinality

Reply via email to