> With 180 fields, querying them all together might be RAM intensive. I would > definitely recommend using the stress tool to model that schema and the query > resource needs. [snip] > Option 1 has ~180 series per metaseries. Option 2 has two series per > metaseries. The performance won't be identical, but I hesitate to guess how > they would differ.
I'll try influx_stress to check the performance. > A couple notes. First, ALWAYS use a "WHERE time" clause with a "SELECT *", or > else you can easily OOM the machine. The LIMIT clause doesn't yet restrict > the results queried, just the results returned. See > https://github.com/influxdata/influxdb/issues/7182 for more.) Actually I've been always using "WHERE time" clause. Sorry for the poor examples. But I didn't know the performance characteristic of LIMIT queries. > This would return all the fields, but it might be clearer to use a "GROUP BY > *" clause to separate all the points into their own x,y buckets. Good to know. Thanks. Regards, Mitsutoshi 2016年10月5日水曜日 3時23分05秒 UTC+9 Sean Beckett: > With 180 fields, querying them all together might be RAM intensive. I would > definitely recommend using the stress tool to model that schema and the query > resource needs. > > > >> It would be quick to return queries and the field set would be small. > > > I'm not sure why this is the case. If I always query all N points at a > > given time to draw the line, don't the option 1 and 2 have roughly the same > > performance? > > I thought A and B scan the same number of series. Am I right? > > > > Yes and no. We're currently updating the docs to rename what we currently > call a series to the "metaseries". A metaseries is the combination of > measurement + tagset. A series is how the data is actually stored in the TSM > files, and that's a measurement + tagset + field. > > > Option 1 has ~180 series per metaseries. Option 2 has two series per > metaseries. The performance won't be identical, but I hesitate to guess how > they would differ. > > > > > For example, > > > > A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema > > > A couple notes. First, ALWAYS use a "WHERE time" clause with a "SELECT *", or > else you can easily OOM the machine. The LIMIT clause doesn't yet restrict > the results queried, just the results returned. See > https://github.com/influxdata/influxdb/issues/7182 for more.) > > > > B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 # > > with the 2nd schema > > > This would return all the fields, but it might be clearer to use a "GROUP BY > *" clause to separate all the points into their own x,y buckets. > > > > > On Mon, Oct 3, 2016 at 10:34 PM, Mitsutoshi Aoe <[email protected]> wrote: > > Sorry, I had a miscalculation again! > > > > > Fistly, I had a miscalculation on the number of points N. N is more like > > 90. So if I take the first route, the number of fields would be about 90+. > > > I meant to say the number of fields would be about 180+. > > > Mitsutoshi > > > 2016年10月4日(火) 13:32 Mitsutoshi Aoe <[email protected]>: > > > > Hi Sean, > Thank you for your reply. > > > Fistly, I had a miscalculation on the number of points N. N is more like 90. > So if I take the first route, the number of fields would be about 90+. > > > > > It would be quick to return queries and the field set would be small. > > > > I'm not sure why this is the case. If I always query all N points at a given > time to draw the line, don't the option 1 and 2 have roughly the same > performance? > > For example, > > > A) SELECT * FROM "line" ORDER BY "time" DESC LIMIT 1 # with the 1st schema > B) SELECT * FROM "point" GROUP BY "name" ORDER BY "time" DESC LIMIT 1 # with > the 2nd schema > > > I thought A and B scan the same number of series. Am I right? > > > > > You can submit explicit timestamps at write time, rather than letting the > > system determine them. Alternately, if you leave the timestamps out, then > > every point in the batch will get the same timestamp. > > > > > True. I just feel a bit uneasy to rely on the assumption that the query B > always returns all the points consist of a line. Yes, we could use batch > writing to ensure all points would have the same timestamp and would be > written at the same time. Whereas in the 1st schema, it is guaranteed that > relevant points are bundled up in a response by construction, which is nice. > But I guess this is not a big deal. > > > > > There isn't really a best practice for arrays in InfluxDB. I would start by > >modeling schemas 1 and 2 using the influx_stress tool to generate randomized > >load but with a defined schema > > > > Thank you for the pointer! I'll give it a try. > > > Regards, > Mitsutoshi > > > > > > > 2016年10月4日(火) 12:08 Sean Beckett <[email protected]>: > > > > > > > > > > On Sun, Oct 2, 2016 at 8:24 PM, Mitsutoshi Aoe <[email protected]> wrote: > > Hi all, > > > I'm now trying to encode a set of time-varying 2D points into an InfluxDB > measurement. > > > Suppose we write N data points (p_0 .. p_N-1) on xy-plane frequently (every > second or so). N isn't large (< 20) and may occasionally change over time > (e.g. every few months). The data points represents a line on the plane over > time. We continuously query those data points from InfluxDB to render the > line realtime or at points in time. We usually need the whole points > (p_0..p_N-1) at once and never query a part of them. > > > What the best schema for this use case? I can think of a few ideas: > > > 1. Encode all the points as fields > > > > > line p0.x=0.0,p0.y=1.0,p1.x=0.1,p1.y=0.2,... > > > > > > > This has low series cardinality but high field cardinality. The RAM needs of > the system would be fairly low, and because each field is densely populated > it would compress and query fairly well. There can be performance issues > querying many fields at once, but since the field count is less than 40 and > they are all floats, it might be okay depending on your query frequency. > > > > > > 2. Use a tag to distinguish points > > > > > point name=p0 x=0.0,y=1.0 > point name=p1 x=0.1,y=0.2 > > > > > > > This would potentially lead to high series cardinality, unless the point > names don't change over time. It would be quick to return queries and the > field set would be small. I don't think we have performance modeling for the > tradeoffs between tags and fields at 40+, but this is the schema I would > start with, other considerations aside. > > > > > > 3. Serialize all the points as a string > > > > > line value="[(0.0,1.0),(0.1,0.2)]" > It's not an efficient format but just to sketch the idea. > > > > > > This would be storing long strings, which is not the best for compressibility > or RAM usage. There are also no string functions in InfluxDB like substr or > find, so you would always have to return the entire line and work with that. > > > > > > > > 1 looks good. I'm somehow uncomfortable with using fields names to > distinguish points though. I feel better with 2 in this regard. But the > problem with 2 is that reconstructing the line from the points are > unnecessarily complicated: > > > 2-A. Each point in the same line can have different timestamps. Whereas 1 > guarantees that all points in the same line have the same timestamp. > > > > > > You can submit explicit timestamps at write time, rather than letting the > system determine them. Alternately, if you leave the timestamps out, then > every point in the batch will get the same timestamp. As long as points on > lines are all in the same batch they will all have the same timestamp. > > > > > > 2-B. How much data points do we need to query to draw the current line? > There's no guarantee that fetching N data points covers all data points that > are necessary to reconstruct the line. > > > > > > This would require careful batching when writing, or using another tag to > differentiate the lines from each other. > > > > > > 3 looks terrible in terms of space efficiency. But it might be easiest to > reconstruct the line if you have a handy text parser. > > > > It would be ideal if I could just store an array of numbers as a field value > in InfluxDB. But currently there seems to be no such feature. What's the > current best practice? > > > > > > There isn't really a best practice for arrays in InfluxDB. I would start by > modeling schemas 1 and 2 using the influx_stress tool to generate randomized > load but with a defined schema. > > > > > > > > > > Thanks, > Mitsutoshi > > > > > -- > > Remember to include the InfluxDB version number with all issue reports > > --- > > You received this message because you are subscribed to the Google Groups > "InfluxDB" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To post to this group, send email to [email protected]. > > Visit this group at https://groups.google.com/group/influxdb. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/influxdb/f2f4bfec-fc87-44b4-a158-262dd657c560%40googlegroups.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > -- > > > Sean Beckett > Director of Support and Professional Services > InfluxDB > > > > > > -- > > Remember to include the InfluxDB version number with all issue reports > > --- > > You received this message because you are subscribed to the Google Groups > "InfluxDB" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To post to this group, send email to [email protected]. > > Visit this group at https://groups.google.com/group/influxdb. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/influxdb/CALGqCvP1%2BddhL%2B%3DGi8H7urCv_pMCnF37ih87%2BJ36FbTyi%3DN3rg%40mail.gmail.com. > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > -- > > Remember to include the InfluxDB version number with all issue reports > > --- > > You received this message because you are subscribed to the Google Groups > "InfluxDB" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > To post to this group, send email to [email protected]. > > Visit this group at https://groups.google.com/group/influxdb. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/influxdb/CAMLnt0O9T551FMWY_xxan8d%2BLffto614i0SrOPpBfEYEZq%2BH4g%40mail.gmail.com. > > > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > > Sean Beckett > Director of Support and Professional Services > InfluxDB -- Remember to include the InfluxDB version number with all issue reports --- You received this message because you are subscribed to the Google Groups "InfluxDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/d0c69487-6172-47b0-b3a1-7463cc7ab7f3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
