Re: [influxdb] Schema design for arrays of points

Sean Beckett Tue, 04 Oct 2016 11:23:22 -0700

With 180 fields, querying them all together might be RAM intensive. I would
definitely recommend using the stress tool to model that schema and the
query resource needs.


>> It would be quick to return queries and the field set would be small.

> I'm not sure why this is the case. If I always query all N points at a
given time to draw the line, don't the option 1 and 2 have roughly the same
performance?
> I thought A and B scan the same number of series. Am I right?

Yes and no. We're currently updating the docs to rename what we currently
call a series to the "metaseries". A metaseries is the combination of
measurement + tagset. A series is how the data is actually stored in the
TSM files, and that's a measurement + tagset + field.

Option 1 has ~180 series per metaseries. Option 2 has two series per
metaseries. The performance won't be identical, but I hesitate to guess how
they would differ.

> For example,

> A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema

A couple notes. First, ALWAYS use a "WHERE time" clause with a "SELECT *",
or else you can easily OOM the machine. The LIMIT clause doesn't yet
restrict the results queried, just the results returned. See
https://github.com/influxdata/influxdb/issues/7182 for more.)

> B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 #
with the 2nd schema

This would return all the fields, but it might be clearer to use a "GROUP
BY *" clause to separate all the points into their own x,y buckets.


On Mon, Oct 3, 2016 at 10:34 PM, Mitsutoshi Aoe <[email protected]> wrote:

> Sorry, I had a miscalculation again!
>
> > Fistly, I had a miscalculation on the number of points N. N is more like
> 90. So if I take the first route, the number of fields would be about 90+.
>
> I meant to say the number of fields would be about 180+.
>
> Mitsutoshi
>
> 2016年10月4日(火) 13:32 Mitsutoshi Aoe <[email protected]>:
>
>> Hi Sean,
>> Thank you for your reply.
>>
>> Fistly, I had a miscalculation on the number of points N. N is more like
>> 90. So if I take the first route, the number of fields would be about 90+.
>>
>>
>> > It would be quick to return queries and the field set would be small.
>>
>> I'm not sure why this is the case. If I always query all N points at a
>> given time to draw the line, don't the option 1 and 2 have roughly the same
>> performance?
>>
>> For example,
>>
>> A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema
>> B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 #
>> with the 2nd schema
>>
>> I thought A and B scan the same number of series. Am I right?
>>
>> > You can submit explicit timestamps at write time, rather than letting
>> the system determine them. Alternately, if you leave the timestamps out,
>> then every point in the batch will get the same timestamp.
>>
>> True. I just feel a bit uneasy to rely on the assumption that the query B
>> always returns all the points consist of a line. Yes, we could use batch
>> writing to ensure all points would have the same timestamp and would be
>> written at the same time. Whereas in the 1st schema, it is guaranteed that
>> relevant points are bundled up in a response by construction, which is
>> nice. But I guess this is not a big deal.
>>
>> > There isn't really a best practice for arrays in InfluxDB. I would
>> start by modeling schemas 1 and 2 using the influx_stress tool to generate
>> randomized load but with a defined schema
>>
>> Thank you for the pointer! I'll give it a try.
>>
>> Regards,
>> Mitsutoshi
>>
>> 2016年10月4日(火) 12:08 Sean Beckett <[email protected]>:
>>
>>> On Sun, Oct 2, 2016 at 8:24 PM, Mitsutoshi Aoe <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm now trying to encode a set of time-varying 2D points into
>>>> an InfluxDB measurement.
>>>>
>>>> Suppose we write N data points (p_0 .. p_N-1) on xy-plane frequently
>>>> (every second or so). N isn't large (< 20) and may occasionally change over
>>>> time (e.g. every few months). The data points represents a line on the
>>>> plane over time. We continuously query those data points from InfluxDB to
>>>> render the line realtime or at points in time. We usually need the whole
>>>> points (p_0..p_N-1) at once and never query a part of them.
>>>>
>>>> What the best schema for this use case? I can think of a few ideas:
>>>>
>>>> 1. Encode all the points as fields
>>>>
>>>> line p0.x=0.0,p0.y=1.0,p1.x=0.1,p1.y=0.2,...
>>>>
>>>>
>>> This has low series cardinality but high field cardinality. The RAM
>>> needs of the system would be fairly low, and because each field is densely
>>> populated it would compress and query fairly well. There can be performance
>>> issues querying many fields at once, but since the field count is less than
>>> 40 and they are all floats, it might be okay depending on your query
>>> frequency.
>>>
>>>
>>>> 2. Use a tag to distinguish points
>>>>
>>>> point name=p0 x=0.0,y=1.0
>>>> point name=p1 x=0.1,y=0.2
>>>>
>>>>
>>> This would potentially lead to high series cardinality, unless the point
>>> names don't change over time. It would be quick to return queries and the
>>> field set would be small. I don't think we have performance modeling for
>>> the tradeoffs between tags and fields at 40+, but this is the schema I
>>> would start with, other considerations aside.
>>>
>>>
>>>> 3. Serialize all the points as a string
>>>>
>>>> line value="[(0.0,1.0),(0.1,0.2)]"
>>>> It's not an efficient format but just to sketch the idea.
>>>>
>>>
>>> This would be storing long strings, which is not the best for
>>> compressibility or RAM usage. There are also no string functions in
>>> InfluxDB like substr or find, so you would always have to return the entire
>>> line and work with that.
>>>
>>>
>>>>
>>>> 1 looks good. I'm somehow uncomfortable with using fields names to
>>>> distinguish points though. I feel better with 2 in this regard. But the
>>>> problem with 2 is that reconstructing the line from the points are
>>>> unnecessarily complicated:
>>>>
>>>> 2-A. Each point in the same line can have different timestamps. Whereas
>>>> 1 guarantees that all points in the same line have the same timestamp.
>>>>
>>>
>>> You can submit explicit timestamps at write time, rather than letting
>>> the system determine them. Alternately, if you leave the timestamps out,
>>> then every point in the batch will get the same timestamp. As long as
>>> points on lines are all in the same batch they will all have the same
>>> timestamp.
>>>
>>>
>>>> 2-B. How much data points do we need to query to draw the current line?
>>>> There's no guarantee that fetching N data points covers all data points
>>>> that are necessary to reconstruct the line.
>>>>
>>>
>>> This would require careful batching when writing, or using another tag
>>> to differentiate the lines from each other.
>>>
>>>
>>>> 3 looks terrible in terms of space efficiency. But it might be easiest
>>>> to reconstruct the line if you have a handy text parser.
>>>>
>>>> It would be ideal if I could just store an array of numbers as a field
>>>> value in InfluxDB. But currently there seems to be no such feature. What's
>>>> the current best practice?
>>>>
>>>
>>> There isn't really a best practice for arrays in InfluxDB. I would start
>>> by modeling schemas 1 and 2 using the influx_stress
>>> <https://github.com/influxdata/influxdb/tree/master/stress/v2> tool to
>>> generate randomized load but with a defined schema.
>>>
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Mitsutoshi
>>>>
>>>> --
>>>> Remember to include the InfluxDB version number with all issue reports
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "InfluxDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Sean Beckett
>>> Director of Support and Professional Services
>>> InfluxDB
>>>
>>> --
>>> Remember to include the InfluxDB version number with all issue reports
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "InfluxDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/influxdb.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_
>>> pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> Remember to include the InfluxDB version number with all issue reports
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/CAMLnt0O9T551FMWY_xxan8d%2BLffto614i0SrOPpBfEYEZq%
> 2BH4g%40mail.gmail.com
> <https://groups.google.com/d/msgid/influxdb/CAMLnt0O9T551FMWY_xxan8d%2BLffto614i0SrOPpBfEYEZq%2BH4g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvNZ%2Bk%2BxmXOgiiK%3DMHrsuh5EMErjoxtbv6cdZeTNKCW-uQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Schema design for arrays of points

Reply via email to