Re: [influxdb] schema design advice

Mathias Herberts Tue, 21 Jun 2016 12:43:38 -0700

One thing I really don't understand is why you have the notion of 
measurement in the first place. Since fields from different measurements 
cannot be joined, it would be safer to only have a single measurement, just 
in case any join should be done in the future.



On Tuesday, June 21, 2016 at 6:20:59 PM UTC+2, Sean Beckett wrote:
>
>
>
> On Mon, Jun 20, 2016 at 5:56 PM, Zoltan Szalai <[email protected] 
> <javascript:>> wrote:
>
>> On 2016.06.21. 1:22, Sean Beckett wrote:
>>
>>
>>
>> On Mon, Jun 20, 2016 at 7:25 AM, Zoltan Szalai <[email protected] 
>> <javascript:>> wrote:
>>
>>> Hello,
>>>
>>> I would like to ask schema design advise for the following use case:
>>>
>>> I want to collect time series data of trips of cars.
>>> I collect about 20-30 different metrics per trip (like speed, rpm, 
>>> location, temperature etc.) with different frequencies ranging from 3 
>>> seconds to 1 minute.
>>> A car generates about 150-300 trips in a month, 2000-3000 trip a year.
>>>
>>> I have some data about the trips stored in an rdbms like trip_id, car, 
>>> driver, start datetime, end datetime and many more and I want to store only 
>>> the raw, collected time series data in influxdb. Among the raw data 
>>> collected from the cars I have to calculate additional time series data 
>>> derived from the collected data, like fuel consumption and also single 
>>> values like distance, average speed etc. typically when a trip ends.
>>>
>>> I had/have the following ideas in my mind:
>>> 1. My first naive approach was to just simple store every trip's data in 
>>> its own measurement (trip_<trip_id>) without any tags but that would 
>>> generate a huge amount of series sooner or later. Since then I've learned 
>>> that series cardinality matters the most in terms of memory usage and 
>>> performance so it's obviously not a good choice. So any option that would 
>>> use the trip ids as tag values sounds like a really bad idea because I'll 
>>> have a lot of trips.
>>>
>>
>> Encoding metadata in the measurement name is a bad practice. You should 
>> have dozens, maybe hundreds of measurements. More than that is a schema 
>> smell.
>>  
>>
>>
>> Actually in this case, eventually I'd have as many series / measurements 
>> as trips which will be millions. So yeah, not a good idea.
>>
>
> A series is in no way equivalent to a measurement. Each measurement is a 
> logical container of perhaps thousands or hundreds of thousands of series. 
> Each series in a measurement is joined by default for query results. Series 
> in two different measurements cannot be joined at all.
>
> You wouldn't have a measurement for each trip. I think you should have 
> only one measurement for all trips. You would have a series for each trip, 
> however.
>
> Total series count is what matters most for RAM usage. 
>
>
>  
>
>>
>>
>>
>> 2. Store all the data in a single measurement and use for example the car 
>>> id as a tag. That would generate giant amount of points in a single 
>>> measurement (is it a problem?) and would make horizontal partitioning of 
>>> the data difficult if I'd need to use multiple hosts / databases later.
>>>
>>
>> 2-3k trips per year per car. Are you monitoring more than a few hundred 
>> cars? InfluxDB is fine up to about a million series, and then things get 
>> interesting. 
>>
>>
>> Not yet, but the number of cars will grow to thousands or perphaps tens 
>> of thousands (or more) in time. At least, that's the plan. :)
>>
>>
>> Why not shard the cars into separate instances? That way the series 
>> cardinality can be divided up amongst multiple InfluxDB processes. If you 
>> need to query across cars this isn't as useful, but if each car is a 
>> separate consumer it seems reasonable.
>>
>>
>> Basically, that is what I'm suggesting in 3. 
>>
>
> No, the series cardinality concerns are per instance, not per database or 
> per measurement. The entire InfluxDB server shares series cardinality. The 
> only way to distribute the cardinality load is to put the series on 
> different InfluxDB servers, whether different nodes in a cluster or 
> different instances entirely.
>  
>
>>
>>  
>>
>>> 3. Measurement / per car, no tags. This seems like the best approach for 
>>> me at the moment. Horizontal partitioning would be relatively trivial using 
>>> car id as shard key if the number of cars on the system increases. Series 
>>> cardinality would equal with car cardinality. Obviously in order to get 
>>> only the data of a single trip (which is important) I'd have to rely on the 
>>> meta data (the trip's start and end time) stored in the rdbms. That is true 
>>> for 2. as well.
>>>
>>
>> Too many measurements, and with no tags you're losing 50%+ of the 
>> functionality of the query language.
>>
>>
>> I'd have as many measurements (and thereby series) as cars in the system. 
>> And not necessarily in one database.
>> I thought only the series cardinality counts, not the measurement 
>> cardinality. 
>>
>
> Measurements are containers of series, they do not exist independent of 
> the series they contain. Think of them as mini-databases within the 
> database. Every measurement has at least one series.
>  
>
>> What impacts series cardinality the most is the number of different tag 
>> values if I understand correctly.
>>
>
> The number of actual combinations of tag values. See 
> https://groups.google.com/d/msgid/influxdb/23486fd5-4f40-4ae7-9573-a09eb55e6683%40googlegroups.com
>  
> for more context.
>  
>
>> I am not sure how I could utilize tags at his point but maybe I'm missing 
>> something.
>>
>>
>> I'm open to new ideas as well. What do you think?
>>>
>>
>> Start with your queries, work backwards from there. What questions does 
>> the schema need to support?
>>  
>>
>>
>>
>> The most important one is: give me all the data that belongs to a single 
>> trip. (so I can make some calculations from them and generate other series).
>> for 1. this would be select * from trip_<trip_id>
>> for 2. select * from the_one_giant_measurement where car_id = 
>> <trips_car_id> and time > <trip_start> and time < <trip_end>
>> for 3. select * from car_<car_id> where time > <trip_start> and time < 
>> <trip_end>
>>
>
> I think you want a schema that uses trip_id as a tag, and then you shard 
> the trips across multiple instances when cardinality becomes a concern. 
>
> SELECT * FROM trips WHERE trip_id = '1234'
>
>  
>
>>
>>
>> Thanks for your time, Sean!
>>
>>
>>> Thanks
>>> Zoltan
>>>
>>> -- 
>>> Remember to include the InfluxDB version number with all issue reports
>>> --- You received this message because you are subscribed to the Google 
>>> Groups "InfluxDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to <javascript:>[email protected] <javascript:>.
>>> To post to this group, send email to <javascript:>
>>> [email protected] <javascript:>.
>>> Visit this group at https://groups.google.com/group/influxdb.
>>> To view this discussion on the web visit 
>>> <https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com>
>>> https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Sean Beckett 
>> Director of Support and Professional Services
>> InfluxDB
>> -- 
>> Remember to include the InfluxDB version number with all issue reports
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "InfluxDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/influxdb.
>> To view this discussion on the web visit 
>> <https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> -- 
>> Remember to include the InfluxDB version number with all issue reports
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "InfluxDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/influxdb.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com
>>  
>> <https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Sean Beckett
> Director of Support and Professional Services
> InfluxDB
>

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] schema design advice

Reply via email to