One thing I really don't understand is why you have the notion of measurement in the first place. Since fields from different measurements cannot be joined, it would be safer to only have a single measurement, just in case any join should be done in the future.
On Tuesday, June 21, 2016 at 6:20:59 PM UTC+2, Sean Beckett wrote: > > > > On Mon, Jun 20, 2016 at 5:56 PM, Zoltan Szalai <[email protected] > <javascript:>> wrote: > >> On 2016.06.21. 1:22, Sean Beckett wrote: >> >> >> >> On Mon, Jun 20, 2016 at 7:25 AM, Zoltan Szalai <[email protected] >> <javascript:>> wrote: >> >>> Hello, >>> >>> I would like to ask schema design advise for the following use case: >>> >>> I want to collect time series data of trips of cars. >>> I collect about 20-30 different metrics per trip (like speed, rpm, >>> location, temperature etc.) with different frequencies ranging from 3 >>> seconds to 1 minute. >>> A car generates about 150-300 trips in a month, 2000-3000 trip a year. >>> >>> I have some data about the trips stored in an rdbms like trip_id, car, >>> driver, start datetime, end datetime and many more and I want to store only >>> the raw, collected time series data in influxdb. Among the raw data >>> collected from the cars I have to calculate additional time series data >>> derived from the collected data, like fuel consumption and also single >>> values like distance, average speed etc. typically when a trip ends. >>> >>> I had/have the following ideas in my mind: >>> 1. My first naive approach was to just simple store every trip's data in >>> its own measurement (trip_<trip_id>) without any tags but that would >>> generate a huge amount of series sooner or later. Since then I've learned >>> that series cardinality matters the most in terms of memory usage and >>> performance so it's obviously not a good choice. So any option that would >>> use the trip ids as tag values sounds like a really bad idea because I'll >>> have a lot of trips. >>> >> >> Encoding metadata in the measurement name is a bad practice. You should >> have dozens, maybe hundreds of measurements. More than that is a schema >> smell. >> >> >> >> Actually in this case, eventually I'd have as many series / measurements >> as trips which will be millions. So yeah, not a good idea. >> > > A series is in no way equivalent to a measurement. Each measurement is a > logical container of perhaps thousands or hundreds of thousands of series. > Each series in a measurement is joined by default for query results. Series > in two different measurements cannot be joined at all. > > You wouldn't have a measurement for each trip. I think you should have > only one measurement for all trips. You would have a series for each trip, > however. > > Total series count is what matters most for RAM usage. > > > > >> >> >> >> 2. Store all the data in a single measurement and use for example the car >>> id as a tag. That would generate giant amount of points in a single >>> measurement (is it a problem?) and would make horizontal partitioning of >>> the data difficult if I'd need to use multiple hosts / databases later. >>> >> >> 2-3k trips per year per car. Are you monitoring more than a few hundred >> cars? InfluxDB is fine up to about a million series, and then things get >> interesting. >> >> >> Not yet, but the number of cars will grow to thousands or perphaps tens >> of thousands (or more) in time. At least, that's the plan. :) >> >> >> Why not shard the cars into separate instances? That way the series >> cardinality can be divided up amongst multiple InfluxDB processes. If you >> need to query across cars this isn't as useful, but if each car is a >> separate consumer it seems reasonable. >> >> >> Basically, that is what I'm suggesting in 3. >> > > No, the series cardinality concerns are per instance, not per database or > per measurement. The entire InfluxDB server shares series cardinality. The > only way to distribute the cardinality load is to put the series on > different InfluxDB servers, whether different nodes in a cluster or > different instances entirely. > > >> >> >> >>> 3. Measurement / per car, no tags. This seems like the best approach for >>> me at the moment. Horizontal partitioning would be relatively trivial using >>> car id as shard key if the number of cars on the system increases. Series >>> cardinality would equal with car cardinality. Obviously in order to get >>> only the data of a single trip (which is important) I'd have to rely on the >>> meta data (the trip's start and end time) stored in the rdbms. That is true >>> for 2. as well. >>> >> >> Too many measurements, and with no tags you're losing 50%+ of the >> functionality of the query language. >> >> >> I'd have as many measurements (and thereby series) as cars in the system. >> And not necessarily in one database. >> I thought only the series cardinality counts, not the measurement >> cardinality. >> > > Measurements are containers of series, they do not exist independent of > the series they contain. Think of them as mini-databases within the > database. Every measurement has at least one series. > > >> What impacts series cardinality the most is the number of different tag >> values if I understand correctly. >> > > The number of actual combinations of tag values. See > https://groups.google.com/d/msgid/influxdb/23486fd5-4f40-4ae7-9573-a09eb55e6683%40googlegroups.com > > for more context. > > >> I am not sure how I could utilize tags at his point but maybe I'm missing >> something. >> >> >> I'm open to new ideas as well. What do you think? >>> >> >> Start with your queries, work backwards from there. What questions does >> the schema need to support? >> >> >> >> >> The most important one is: give me all the data that belongs to a single >> trip. (so I can make some calculations from them and generate other series). >> for 1. this would be select * from trip_<trip_id> >> for 2. select * from the_one_giant_measurement where car_id = >> <trips_car_id> and time > <trip_start> and time < <trip_end> >> for 3. select * from car_<car_id> where time > <trip_start> and time < >> <trip_end> >> > > I think you want a schema that uses trip_id as a tag, and then you shard > the trips across multiple instances when cardinality becomes a concern. > > SELECT * FROM trips WHERE trip_id = '1234' > > > >> >> >> Thanks for your time, Sean! >> >> >>> Thanks >>> Zoltan >>> >>> -- >>> Remember to include the InfluxDB version number with all issue reports >>> --- You received this message because you are subscribed to the Google >>> Groups "InfluxDB" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to <javascript:>[email protected] <javascript:>. >>> To post to this group, send email to <javascript:> >>> [email protected] <javascript:>. >>> Visit this group at https://groups.google.com/group/influxdb. >>> To view this discussion on the web visit >>> <https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com> >>> https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Sean Beckett >> Director of Support and Professional Services >> InfluxDB >> -- >> Remember to include the InfluxDB version number with all issue reports >> --- >> You received this message because you are subscribed to the Google Groups >> "InfluxDB" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/influxdb. >> To view this discussion on the web visit >> <https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com >> . >> For more options, visit https://groups.google.com/d/optout. >> >> >> -- >> Remember to include the InfluxDB version number with all issue reports >> --- >> You received this message because you are subscribed to the Google Groups >> "InfluxDB" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/influxdb. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com >> >> <https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Sean Beckett > Director of Support and Professional Services > InfluxDB > -- Remember to include the InfluxDB version number with all issue reports --- You received this message because you are subscribed to the Google Groups "InfluxDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
