Re: [influxdb] schema design advice

Mathias Herberts Tue, 21 Jun 2016 13:57:55 -0700

You could want to correlate cq stats with other write stats, so somehow 
joining them may make perfect sense.


You should never consider that two measurements will never be joined, this 
is othogonal to what data science really is, i.e. exploration in many 
different ways to discover patterns or trends.

On Tuesday, June 21, 2016 at 9:47:37 PM UTC+2, Sean Beckett wrote:
>
> Measurements are a way to segregate related field sets. You can put 
> everything into one measurement, but it makes things more confusing. You 
> can also put everything in one database, but why do that if they are 
> unrelated.
>
> Look at the _internal database for an example of why separate measurements 
> makes sense. Why store "write" stats in the same place as "cq" stats, or 
> alongside "runtime" information? They aren't related, you'd never JOIN 
> them, so might as well keep them logically separate for clarity. 
>
> On Tue, Jun 21, 2016 at 1:43 PM, Mathias Herberts <[email protected] 
> <javascript:>> wrote:
>
>> One thing I really don't understand is why you have the notion of 
>> measurement in the first place. Since fields from different measurements 
>> cannot be joined, it would be safer to only have a single measurement, just 
>> in case any join should be done in the future.
>>
>>
>> On Tuesday, June 21, 2016 at 6:20:59 PM UTC+2, Sean Beckett wrote:
>>>
>>>
>>>
>>> On Mon, Jun 20, 2016 at 5:56 PM, Zoltan Szalai <[email protected]> 
>>> wrote:
>>>
>>>> On 2016.06.21. 1:22, Sean Beckett wrote:
>>>>
>>>>
>>>>
>>>> On Mon, Jun 20, 2016 at 7:25 AM, Zoltan Szalai <[email protected]> 
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I would like to ask schema design advise for the following use case:
>>>>>
>>>>> I want to collect time series data of trips of cars.
>>>>> I collect about 20-30 different metrics per trip (like speed, rpm, 
>>>>> location, temperature etc.) with different frequencies ranging from 3 
>>>>> seconds to 1 minute.
>>>>> A car generates about 150-300 trips in a month, 2000-3000 trip a year.
>>>>>
>>>>> I have some data about the trips stored in an rdbms like trip_id, car, 
>>>>> driver, start datetime, end datetime and many more and I want to store 
>>>>> only 
>>>>> the raw, collected time series data in influxdb. Among the raw data 
>>>>> collected from the cars I have to calculate additional time series data 
>>>>> derived from the collected data, like fuel consumption and also single 
>>>>> values like distance, average speed etc. typically when a trip ends.
>>>>>
>>>>> I had/have the following ideas in my mind:
>>>>> 1. My first naive approach was to just simple store every trip's data 
>>>>> in its own measurement (trip_<trip_id>) without any tags but that would 
>>>>> generate a huge amount of series sooner or later. Since then I've learned 
>>>>> that series cardinality matters the most in terms of memory usage and 
>>>>> performance so it's obviously not a good choice. So any option that would 
>>>>> use the trip ids as tag values sounds like a really bad idea because I'll 
>>>>> have a lot of trips.
>>>>>
>>>>
>>>> Encoding metadata in the measurement name is a bad practice. You should 
>>>> have dozens, maybe hundreds of measurements. More than that is a schema 
>>>> smell.
>>>>  
>>>>
>>>>
>>>> Actually in this case, eventually I'd have as many series / 
>>>> measurements as trips which will be millions. So yeah, not a good idea.
>>>>
>>>
>>> A series is in no way equivalent to a measurement. Each measurement is a 
>>> logical container of perhaps thousands or hundreds of thousands of series. 
>>> Each series in a measurement is joined by default for query results. Series 
>>> in two different measurements cannot be joined at all.
>>>
>>> You wouldn't have a measurement for each trip. I think you should have 
>>> only one measurement for all trips. You would have a series for each trip, 
>>> however.
>>>
>>> Total series count is what matters most for RAM usage. 
>>>
>>>
>>>  
>>>
>>>>
>>>>
>>>>
>>>> 2. Store all the data in a single measurement and use for example the 
>>>>> car id as a tag. That would generate giant amount of points in a single 
>>>>> measurement (is it a problem?) and would make horizontal partitioning of 
>>>>> the data difficult if I'd need to use multiple hosts / databases later.
>>>>>
>>>>
>>>> 2-3k trips per year per car. Are you monitoring more than a few hundred 
>>>> cars? InfluxDB is fine up to about a million series, and then things get 
>>>> interesting. 
>>>>
>>>>
>>>> Not yet, but the number of cars will grow to thousands or perphaps tens 
>>>> of thousands (or more) in time. At least, that's the plan. :)
>>>>
>>>>
>>>> Why not shard the cars into separate instances? That way the series 
>>>> cardinality can be divided up amongst multiple InfluxDB processes. If you 
>>>> need to query across cars this isn't as useful, but if each car is a 
>>>> separate consumer it seems reasonable.
>>>>
>>>>
>>>> Basically, that is what I'm suggesting in 3. 
>>>>
>>>
>>> No, the series cardinality concerns are per instance, not per database 
>>> or per measurement. The entire InfluxDB server shares series cardinality. 
>>> The only way to distribute the cardinality load is to put the series on 
>>> different InfluxDB servers, whether different nodes in a cluster or 
>>> different instances entirely.
>>>  
>>>
>>>>
>>>>  
>>>>
>>>>> 3. Measurement / per car, no tags. This seems like the best approach 
>>>>> for me at the moment. Horizontal partitioning would be relatively trivial 
>>>>> using car id as shard key if the number of cars on the system increases. 
>>>>> Series cardinality would equal with car cardinality. Obviously in order 
>>>>> to 
>>>>> get only the data of a single trip (which is important) I'd have to rely 
>>>>> on 
>>>>> the meta data (the trip's start and end time) stored in the rdbms. That 
>>>>> is 
>>>>> true for 2. as well.
>>>>>
>>>>
>>>> Too many measurements, and with no tags you're losing 50%+ of the 
>>>> functionality of the query language.
>>>>
>>>>
>>>> I'd have as many measurements (and thereby series) as cars in the 
>>>> system. And not necessarily in one database.
>>>> I thought only the series cardinality counts, not the measurement 
>>>> cardinality. 
>>>>
>>>
>>> Measurements are containers of series, they do not exist independent of 
>>> the series they contain. Think of them as mini-databases within the 
>>> database. Every measurement has at least one series.
>>>  
>>>
>>>> What impacts series cardinality the most is the number of different tag 
>>>> values if I understand correctly.
>>>>
>>>
>>> The number of actual combinations of tag values. See 
>>> https://groups.google.com/d/msgid/influxdb/23486fd5-4f40-4ae7-9573-a09eb55e6683%40googlegroups.com
>>>  
>>> for more context.
>>>  
>>>
>>>> I am not sure how I could utilize tags at his point but maybe I'm 
>>>> missing something.
>>>>
>>>>
>>>> I'm open to new ideas as well. What do you think?
>>>>>
>>>>
>>>> Start with your queries, work backwards from there. What questions does 
>>>> the schema need to support?
>>>>  
>>>>
>>>>
>>>>
>>>> The most important one is: give me all the data that belongs to a 
>>>> single trip. (so I can make some calculations from them and generate other 
>>>> series).
>>>> for 1. this would be select * from trip_<trip_id>
>>>> for 2. select * from the_one_giant_measurement where car_id = 
>>>> <trips_car_id> and time > <trip_start> and time < <trip_end>
>>>> for 3. select * from car_<car_id> where time > <trip_start> and time < 
>>>> <trip_end>
>>>>
>>>
>>> I think you want a schema that uses trip_id as a tag, and then you shard 
>>> the trips across multiple instances when cardinality becomes a concern. 
>>>
>>> SELECT * FROM trips WHERE trip_id = '1234'
>>>
>>>  
>>>
>>>>
>>>>
>>>> Thanks for your time, Sean!
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>> Zoltan
>>>>>
>>>>> -- 
>>>>> Remember to include the InfluxDB version number with all issue reports
>>>>> --- You received this message because you are subscribed to the Google 
>>>>> Groups "InfluxDB" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>>> To view this discussion on the web visit 
>>>>> <https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com>
>>>>> https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Sean Beckett 
>>>> Director of Support and Professional Services
>>>> InfluxDB
>>>> -- 
>>>> Remember to include the InfluxDB version number with all issue reports
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "InfluxDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>> To view this discussion on the web visit 
>>>> <https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> -- 
>>>> Remember to include the InfluxDB version number with all issue reports
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "InfluxDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> -- 
>>> Sean Beckett
>>> Director of Support and Professional Services
>>> InfluxDB
>>>
>> -- 
>> Remember to include the InfluxDB version number with all issue reports
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "InfluxDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/influxdb.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Sean Beckett
> Director of Support and Professional Services
> InfluxDB
>

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/14fbb3e3-1c3b-4f12-8416-6d7653ab984e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] schema design advice

Reply via email to