Hi all, We're working out our data model, and wondering about a solution we've come up with for handling array values.
When we receive a single record with array values, the thought is to split it into multiple records in Influx, and [modifying the lower bits of the timestamp](https://github.com/influxdata/influxdb/issues/2055#issuecomment-85126083) (nanoseconds) to preserve uniqueness and prevent updates on the original record. To ensure unique records, we're first taking a unique ID that's included on all of our records and hashing it into the lower 6 digits of the timestamp. We're also hashing a three digit tag (0-999) to ensure that we can handle more than 999,999 records a second. If a record has array values that we want to track, we would add 1 millisecond to the timestamp for each additional record. Our data is at a one second time resolution, and we don't expect most of our records to have more than 10 array values, so a time shift of a few milliseconds is not concerning. For example, when we receive this record: { "user": "a", "groups": ["a", "b", "c"], "id": "abc", "ip_address": "123.45.67.89", "tenant": "a", "action": "get", "timestamp": "2016-05-19T19:32:26Z", } It would be translated into three separate records within Influx: keys,tenant=a,action=get,uid=001 user=a,group=a,id=abc,ip=123.45.67.89 1463686346000983727 keys,tenant=a,action=get,uid=001 group=b 1463686346001983727 keys,tenant=a,action=get,uid=001 group=c 1463686346002983727 In this case, the lower 6 digits of the timestamp represent the unique hash (983727) of this record, and then the subsequent two records each get one additional millisecond to prevent an update on the original record. We're most interested in COUNT(DISTINCT) queries, to let our users know how many unique values are seen over time. In this case, we could accurately query the number of groups. Example query: SELECT COUNT(DISTINCT("group") from "keys" WHERE time > '2016-12-09T00:00:00Z' GROUP BY tenant, action, time(1d) fill(none) We're curious if others have tried a similar approach, and if this is a recommended way to use Influx? Are there pitfalls that we should be aware of when expanding our data in this manner? --- Did some stress testing to see how the additional records would affect queries on fields that weren't included in the sparse records. After sending the records to influx, queries were done on the ip field (SELECT DISTINCT("ip") FROM "key" WHERE time > '2016-12-09T00:00:00Z' GROUP BY tenant,action,time(1d) fill(none)) Three tests were run, capturing a histogram of response times with 1 and 10 concurrent queries. Test 1a: 1,000,000 regular points (including all fields - user, group, id, ip) Test 1b: 1,000,000 regular points, 500,000 only including additional group field Test 1c: 1,500,000 regular points Test 2a: 1,000,000 regular points Test 2b: 1,000,000 regular points, 5,000,000 with group field Test 2c: 6,000,000 regular points Test 3a: 10,000,000 regular points Test 3b: 10,000,000 regular points, 50,000,000 with group field Test 3c: 60,000,000 regular points Link to the results from these tests: http://imgur.com/a/XCVhr Influx handles the middle cases (1b, 2b, 3b) very well. The extra data has an 10-20% impact in query times in the first and third tests, and actually performed better in most cases of the second test. --- Related threads: https://groups.google.com/forum/#!msg/influxdb/NSAz6_DbK3I/aOHgPANyAwAJ;context-place=searchin/influxdb/array$20dataa%7Csort:relevance https://groups.google.com/forum/#!msg/influxdb/N7yV2rtoUY0/oiqis-crDAAJ;context-place=msg/influxdb/NSAz6_DbK3I/aOHgPANyAwAJ Appreciate any guidance that can be given! -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/ec209436-f115-40d2-a924-28dbe0e3e4d9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
