Re: [influxdb] Schema design for arrays of points

Mitsutoshi Aoe Mon, 03 Oct 2016 21:32:49 -0700

Hi Sean,
Thank you for your reply.

Fistly, I had a miscalculation on the number of points N. N is more like
90. So if I take the first route, the number of fields would be about 90+.

> It would be quick to return queries and the field set would be small.

I'm not sure why this is the case. If I always query all N points at a
given time to draw the line, don't the option 1 and 2 have roughly the same
performance?

For example,

A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema
B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 #
with the 2nd schema

I thought A and B scan the same number of series. Am I right?

> You can submit explicit timestamps at write time, rather than letting the
system determine them. Alternately, if you leave the timestamps out, then
every point in the batch will get the same timestamp.

True. I just feel a bit uneasy to rely on the assumption that the query B
always returns all the points consist of a line. Yes, we could use batch
writing to ensure all points would have the same timestamp and would be
written at the same time. Whereas in the 1st schema, it is guaranteed that
relevant points are bundled up in a response by construction, which is
nice. But I guess this is not a big deal.

> There isn't really a best practice for arrays in InfluxDB. I would start
by modeling schemas 1 and 2 using the influx_stress tool to generate
randomized load but with a defined schema

Thank you for the pointer! I'll give it a try.

Regards,
Mitsutoshi

2016年10月4日(火) 12:08 Sean Beckett <[email protected]>:

On Sun, Oct 2, 2016 at 8:24 PM, Mitsutoshi Aoe <[email protected]> wrote:

Hi all,

I'm now trying to encode a set of time-varying 2D points into an InfluxDB
measurement.

Suppose we write N data points (p_0 .. p_N-1) on xy-plane frequently (every
second or so). N isn't large (< 20) and may occasionally change over time
(e.g. every few months). The data points represents a line on the plane
over time. We continuously query those data points from InfluxDB to render
the line realtime or at points in time. We usually need the whole points
(p_0..p_N-1) at once and never query a part of them.

What the best schema for this use case? I can think of a few ideas:

1. Encode all the points as fields

line p0.x=0.0,p0.y=1.0,p1.x=0.1,p1.y=0.2,...

This has low series cardinality but high field cardinality. The RAM needs
of the system would be fairly low, and because each field is densely
populated it would compress and query fairly well. There can be performance
issues querying many fields at once, but since the field count is less than
40 and they are all floats, it might be okay depending on your query
frequency.

2. Use a tag to distinguish points

point name=p0 x=0.0,y=1.0
point name=p1 x=0.1,y=0.2

This would potentially lead to high series cardinality, unless the point
names don't change over time. It would be quick to return queries and the
field set would be small. I don't think we have performance modeling for
the tradeoffs between tags and fields at 40+, but this is the schema I
would start with, other considerations aside.

3. Serialize all the points as a string

line value="[(0.0,1.0),(0.1,0.2)]"
It's not an efficient format but just to sketch the idea.

This would be storing long strings, which is not the best for
compressibility or RAM usage. There are also no string functions in
InfluxDB like substr or find, so you would always have to return the entire
line and work with that.

1 looks good. I'm somehow uncomfortable with using fields names to
distinguish points though. I feel better with 2 in this regard. But the
problem with 2 is that reconstructing the line from the points are
unnecessarily complicated:

2-A. Each point in the same line can have different timestamps. Whereas 1
guarantees that all points in the same line have the same timestamp.

You can submit explicit timestamps at write time, rather than letting the
system determine them. Alternately, if you leave the timestamps out, then
every point in the batch will get the same timestamp. As long as points on
lines are all in the same batch they will all have the same timestamp.

2-B. How much data points do we need to query to draw the current line?
There's no guarantee that fetching N data points covers all data points
that are necessary to reconstruct the line.

This would require careful batching when writing, or using another tag to
differentiate the lines from each other.

3 looks terrible in terms of space efficiency. But it might be easiest to
reconstruct the line if you have a handy text parser.

It would be ideal if I could just store an array of numbers as a field
value in InfluxDB. But currently there seems to be no such feature. What's
the current best practice?

There isn't really a best practice for arrays in InfluxDB. I would start by
modeling schemas 1 and 2 using the influx_stress
<https://github.com/influxdata/influxdb/tree/master/stress/v2> tool to
generate randomized load but with a defined schema.

Thanks,
Mitsutoshi

-- 
Remember to include the InfluxDB version number with all issue reports
---
You received this message because you are subscribed to the Google Groups
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com
<https://groups.google.com/d/msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the InfluxDB version number with all issue reports
---
You received this message because you are subscribed to the Google Groups
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com
<https://groups.google.com/d/msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CAMLnt0OSu5B7NCurh9M3ufJYLW5J9EabfHagbRtLC2Ozv2aeAQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Schema design for arrays of points

Reply via email to