[influxdb] Feedback on Approach to Storing Array Values in Influx

timjchin Wed, 28 Dec 2016 17:19:02 -0800

Hi all, 
We're working out our data model, and wondering about a solution we've come up 
with for handling array values.


When we receive a single record with array values, the thought is to split it 
into multiple records in Influx, and [modifying the lower bits of the 
timestamp](https://github.com/influxdata/influxdb/issues/2055#issuecomment-85126083)
 (nanoseconds) to preserve uniqueness and prevent updates on the original 
record. To ensure unique records, we're first taking a unique ID that's 
included on all of our records and hashing it into the lower 6 digits of the 
timestamp.  We're also hashing a three digit tag (0-999) to ensure that we can 
handle more than 999,999 records a second. 

If a record has array values that we want to track, we would add 1 millisecond 
to the timestamp for each additional record. Our data is at a one second time 
resolution, and we don't expect most of our records to have more than 10 array 
values, so a time shift of a few milliseconds is not concerning. 

For example, when we receive this record:
{
    "user": "a",
    "groups": ["a", "b", "c"],
    "id": "abc",
    "ip_address": "123.45.67.89",
    "tenant": "a",
    "action": "get",
    "timestamp": "2016-05-19T19:32:26Z",
}

It would be translated into three separate records within Influx:

keys,tenant=a,action=get,uid=001 user=a,group=a,id=abc,ip=123.45.67.89 
1463686346000983727
keys,tenant=a,action=get,uid=001 group=b 1463686346001983727
keys,tenant=a,action=get,uid=001 group=c 1463686346002983727

In this case, the lower 6 digits of the timestamp represent the unique hash 
(983727) of this record, and then the subsequent two records each get one 
additional millisecond to prevent an update on the original record.

We're most interested in COUNT(DISTINCT) queries, to let our users know how 
many unique values are seen over time. In this case, we could accurately query 
the number of groups.

Example query:
SELECT COUNT(DISTINCT("group") from "keys" WHERE time > '2016-12-09T00:00:00Z' 
GROUP BY tenant, action, time(1d) fill(none)

We're curious if others have tried a similar approach, and if this is a 
recommended way to use Influx? Are there pitfalls that we should be aware of 
when expanding our data in this manner?

--- 
Did some stress testing to see how the additional records would affect queries 
on fields that weren't included in the sparse records. After sending the 
records to influx, queries were done on the ip field (SELECT DISTINCT("ip") 
FROM "key" WHERE time > '2016-12-09T00:00:00Z' GROUP BY tenant,action,time(1d) 
fill(none))

Three tests were run, capturing a histogram of response times with 1 and 10 
concurrent queries.
Test 1a: 1,000,000 regular points (including all fields - user, group, id, ip)
Test 1b: 1,000,000 regular points, 500,000 only including additional group field
Test 1c: 1,500,000 regular points

Test 2a: 1,000,000 regular points 
Test 2b: 1,000,000 regular points, 5,000,000 with group field
Test 2c: 6,000,000 regular points

Test 3a: 10,000,000 regular points 
Test 3b: 10,000,000 regular points, 50,000,000 with group field
Test 3c: 60,000,000 regular points

Link to the results from these tests:
http://imgur.com/a/XCVhr

Influx handles the middle cases (1b, 2b, 3b) very well. The extra data has an 
10-20% impact in query times in the first and third tests, and actually 
performed better in most cases of the second test. 

---
Related threads:
https://groups.google.com/forum/#!msg/influxdb/NSAz6_DbK3I/aOHgPANyAwAJ;context-place=searchin/influxdb/array$20dataa%7Csort:relevance
https://groups.google.com/forum/#!msg/influxdb/N7yV2rtoUY0/oiqis-crDAAJ;context-place=msg/influxdb/NSAz6_DbK3I/aOHgPANyAwAJ


Appreciate any guidance that can be given!

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/ec209436-f115-40d2-a924-28dbe0e3e4d9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Feedback on Approach to Storing Array Values in Influx

Reply via email to