Re: [influxdb] Schema design for arrays of points

maoe . mobile Wed, 05 Oct 2016 03:21:40 -0700

> With 180 fields, querying them all together might be RAM intensive. I would 
> definitely recommend using the stress tool to model that schema and the query 
> resource needs.
[snip]
> Option 1 has ~180 series per metaseries. Option 2 has two series per 
> metaseries. The performance won't be identical, but I hesitate to guess how 
> they would differ.


I'll try influx_stress to check the performance.

> A couple notes. First, ALWAYS use a "WHERE time" clause with a "SELECT *", or 
> else you can easily OOM the machine. The LIMIT clause doesn't yet restrict 
> the results queried, just the results returned. See 
> https://github.com/influxdata/influxdb/issues/7182 for more.)

Actually I've been always using "WHERE time" clause. Sorry for the poor 
examples. But I didn't know the performance characteristic of LIMIT queries.

> This would return all the fields, but it might be clearer to use a "GROUP BY 
> *" clause to separate all the points into their own x,y buckets.

Good to know. Thanks.

Regards,
Mitsutoshi

2016年10月5日水曜日 3時23分05秒 UTC+9 Sean Beckett:
> With 180 fields, querying them all together might be RAM intensive. I would 
> definitely recommend using the stress tool to model that schema and the query 
> resource needs.
> 
> 
> >> It would be quick to return queries and the field set would be small.
> 
> > I'm not sure why this is the case. If I always query all N points at a 
> > given time to draw the line, don't the option 1 and 2 have roughly the same 
> > performance?
> > I thought A and B scan the same number of series. Am I right?
> 
> 
> 
> Yes and no. We're currently updating the docs to rename what we currently 
> call a series to the "metaseries". A metaseries is the combination of 
> measurement + tagset. A series is how the data is actually stored in the TSM 
> files, and that's a measurement + tagset + field.
> 
> 
> Option 1 has ~180 series per metaseries. Option 2 has two series per 
> metaseries. The performance won't be identical, but I hesitate to guess how 
> they would differ.
> 
> 
> 
> > For example,
> 
> 
> > A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema
> 
> 
> A couple notes. First, ALWAYS use a "WHERE time" clause with a "SELECT *", or 
> else you can easily OOM the machine. The LIMIT clause doesn't yet restrict 
> the results queried, just the results returned. See 
> https://github.com/influxdata/influxdb/issues/7182 for more.)
> 
> 
> > B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 # 
> > with the 2nd schema
> 
> 
> This would return all the fields, but it might be clearer to use a "GROUP BY 
> *" clause to separate all the points into their own x,y buckets.
> 
> 
> 
> 
> On Mon, Oct 3, 2016 at 10:34 PM, Mitsutoshi Aoe <[email protected]> wrote:
> 
> Sorry, I had a miscalculation again! 
> 
> 
> 
> > Fistly, I had a miscalculation on the number of points N. N is more like 
> > 90. So if I take the first route, the number of fields would be about 90+.
> 
> 
> I meant to say the number of fields would be about 180+.
> 
> 
> Mitsutoshi
> 
> 
> 2016年10月4日(火) 13:32 Mitsutoshi Aoe <[email protected]>:
> 
> 
> 
> Hi Sean,
> Thank you for your reply.
> 
> 
> Fistly, I had a miscalculation on the number of points N. N is more like 90. 
> So if I take the first route, the number of fields would be about 90+.
> 
> 
> 
> > It would be quick to return queries and the field set would be small.
> 
> 
> 
> I'm not sure why this is the case. If I always query all N points at a given 
> time to draw the line, don't the option 1 and 2 have roughly the same 
> performance?
> 
> For example,
> 
> 
> A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema
> B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 # with 
> the 2nd schema
> 
> 
> I thought A and B scan the same number of series. Am I right?
> 
> 
> 
> > You can submit explicit timestamps at write time, rather than letting the 
> > system determine them. Alternately, if you leave the timestamps out, then 
> > every point in the batch will get the same timestamp.
> 
> 
> 
> 
> True. I just feel a bit uneasy to rely on the assumption that the query B 
> always returns all the points consist of a line. Yes, we could use batch 
> writing to ensure all points would have the same timestamp and would be 
> written at the same time. Whereas in the 1st schema, it is guaranteed that 
> relevant points are bundled up in a response by construction, which is nice. 
> But I guess this is not a big deal.
> 
> 
> 
> > There isn't really a best practice for arrays in InfluxDB. I would start by 
> >modeling schemas 1 and 2 using the influx_stress tool to generate randomized 
> >load but with a defined schema
> 
> 
> 
> Thank you for the pointer! I'll give it a try.
> 
> 
> Regards,
> Mitsutoshi
> 
> 
> 
> 
> 
> 
> 2016年10月4日(火) 12:08 Sean Beckett <[email protected]>:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Sun, Oct 2, 2016 at 8:24 PM, Mitsutoshi Aoe <[email protected]> wrote:
> 
> Hi all,
> 
> 
> I'm now trying to encode a set of time-varying 2D points into an InfluxDB 
> measurement.
> 
> 
> Suppose we write N data points (p_0 .. p_N-1) on xy-plane frequently (every 
> second or so). N isn't large (< 20) and may occasionally change over time 
> (e.g. every few months). The data points represents a line on the plane over 
> time. We continuously query those data points from InfluxDB to render the 
> line realtime or at points in time. We usually need the whole points 
> (p_0..p_N-1) at once and never query a part of them.
> 
> 
> What the best schema for this use case? I can think of a few ideas:
> 
> 
> 1. Encode all the points as fields
> 
> 
> 
> 
> line p0.x=0.0,p0.y=1.0,p1.x=0.1,p1.y=0.2,...
> 
> 
> 
> 
> 
> 
> This has low series cardinality but high field cardinality. The RAM needs of 
> the system would be fairly low, and because each field is densely populated 
> it would compress and query fairly well. There can be performance issues 
> querying many fields at once, but since the field count is less than 40 and 
> they are all floats, it might be okay depending on your query frequency.
> 
> 
> 
>  
> 
> 2. Use a tag to distinguish points
> 
> 
> 
> 
> point name=p0 x=0.0,y=1.0
> point name=p1 x=0.1,y=0.2
> 
> 
> 
> 
> 
> 
> This would potentially lead to high series cardinality, unless the point 
> names don't change over time. It would be quick to return queries and the 
> field set would be small. I don't think we have performance modeling for the 
> tradeoffs between tags and fields at 40+, but this is the schema I would 
> start with, other considerations aside. 
> 
> 
> 
>  
> 
> 3. Serialize all the points as a string
> 
> 
> 
> 
> line value="[(0.0,1.0),(0.1,0.2)]"
> It's not an efficient format but just to sketch the idea.
> 
> 
> 
> 
> 
> This would be storing long strings, which is not the best for compressibility 
> or RAM usage. There are also no string functions in InfluxDB like substr or 
> find, so you would always have to return the entire line and work with that.
> 
> 
> 
>  
> 
> 
> 
> 1 looks good. I'm somehow uncomfortable with using fields names to 
> distinguish points though. I feel better with 2 in this regard. But the 
> problem with 2 is that reconstructing the line from the points are 
> unnecessarily complicated:
> 
> 
> 2-A. Each point in the same line can have different timestamps. Whereas 1 
> guarantees that all points in the same line have the same timestamp.
> 
> 
> 
> 
> 
> You can submit explicit timestamps at write time, rather than letting the 
> system determine them. Alternately, if you leave the timestamps out, then 
> every point in the batch will get the same timestamp. As long as points on 
> lines are all in the same batch they will all have the same timestamp.
> 
> 
> 
>  
> 
> 2-B. How much data points do we need to query to draw the current line? 
> There's no guarantee that fetching N data points covers all data points that 
> are necessary to reconstruct the line.
> 
> 
> 
> 
> 
> This would require careful batching when writing, or using another tag to 
> differentiate the lines from each other.
> 
> 
> 
>  
> 
> 3 looks terrible in terms of space efficiency. But it might be easiest to 
> reconstruct the line if you have a handy text parser.
> 
> 
> 
> It would be ideal if I could just store an array of numbers as a field value 
> in InfluxDB. But currently there seems to be no such feature. What's the 
> current best practice?
> 
> 
> 
> 
> 
> There isn't really a best practice for arrays in InfluxDB. I would start by 
> modeling schemas 1 and 2 using the influx_stress tool to generate randomized 
> load but with a defined schema. 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> Thanks,
> Mitsutoshi
> 
> 
> 
> 
> -- 
> 
> Remember to include the InfluxDB version number with all issue reports
> 
> --- 
> 
> You received this message because you are subscribed to the Google Groups 
> "InfluxDB" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> 
> To post to this group, send email to [email protected].
> 
> Visit this group at https://groups.google.com/group/influxdb.
> 
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com.
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> 
> Sean Beckett
> Director of Support and Professional Services
> InfluxDB
> 
> 
> 
> 
> 
> -- 
> 
> Remember to include the InfluxDB version number with all issue reports
> 
> --- 
> 
> You received this message because you are subscribed to the Google Groups 
> "InfluxDB" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> 
> To post to this group, send email to [email protected].
> 
> Visit this group at https://groups.google.com/group/influxdb.
> 
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com.
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Remember to include the InfluxDB version number with all issue reports
> 
> --- 
> 
> You received this message because you are subscribed to the Google Groups 
> "InfluxDB" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> 
> To post to this group, send email to [email protected].
> 
> Visit this group at https://groups.google.com/group/influxdb.
> 
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/influxdb/CAMLnt0O9T551FMWY_xxan8d%2BLffto614i0SrOPpBfEYEZq%2BH4g%40mail.gmail.com.
> 
> 
> 
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> 
> 
> -- 
> 
> 
> Sean Beckett
> Director of Support and Professional Services
> InfluxDB

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/d0c69487-6172-47b0-b3a1-7463cc7ab7f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Schema design for arrays of points

Reply via email to