Re: [influxdb] schema design advice

Sean Beckett Tue, 21 Jun 2016 14:48:13 -0700

Mathias,

You are of course correct that it is possible to imagine joining any given
measurement. I should not have made absolute statements. You asked why we
have separate measurements and I explained that they are logical containers
to assist in organization.


We are also keenly aware that many users want to JOIN across measurements.
It is a challenging feature to say the least, and not on the short term
roadmap. Many thousands of users are getting value from InfluxDB without
JOINs, so clearly it's not a crippling decision to architect the system
this way for the time being.

On Tue, Jun 21, 2016 at 2:57 PM, Mathias Herberts <
[email protected]> wrote:

> You could want to correlate cq stats with other write stats, so somehow
> joining them may make perfect sense.
>
> You should never consider that two measurements will never be joined, this
> is othogonal to what data science really is, i.e. exploration in many
> different ways to discover patterns or trends.
>
> On Tuesday, June 21, 2016 at 9:47:37 PM UTC+2, Sean Beckett wrote:
>>
>> Measurements are a way to segregate related field sets. You can put
>> everything into one measurement, but it makes things more confusing. You
>> can also put everything in one database, but why do that if they are
>> unrelated.
>>
>> Look at the _internal database for an example of why separate
>> measurements makes sense. Why store "write" stats in the same place as "cq"
>> stats, or alongside "runtime" information? They aren't related, you'd never
>> JOIN them, so might as well keep them logically separate for clarity.
>>
>> On Tue, Jun 21, 2016 at 1:43 PM, Mathias Herberts <[email protected]>
>> wrote:
>>
>>> One thing I really don't understand is why you have the notion of
>>> measurement in the first place. Since fields from different measurements
>>> cannot be joined, it would be safer to only have a single measurement, just
>>> in case any join should be done in the future.
>>>
>>>
>>> On Tuesday, June 21, 2016 at 6:20:59 PM UTC+2, Sean Beckett wrote:
>>>>
>>>>
>>>>
>>>> On Mon, Jun 20, 2016 at 5:56 PM, Zoltan Szalai <[email protected]>
>>>> wrote:
>>>>
>>>>> On 2016.06.21. 1:22, Sean Beckett wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 20, 2016 at 7:25 AM, Zoltan Szalai <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I would like to ask schema design advise for the following use case:
>>>>>>
>>>>>> I want to collect time series data of trips of cars.
>>>>>> I collect about 20-30 different metrics per trip (like speed, rpm,
>>>>>> location, temperature etc.) with different frequencies ranging from 3
>>>>>> seconds to 1 minute.
>>>>>> A car generates about 150-300 trips in a month, 2000-3000 trip a year.
>>>>>>
>>>>>> I have some data about the trips stored in an rdbms like trip_id,
>>>>>> car, driver, start datetime, end datetime and many more and I want to 
>>>>>> store
>>>>>> only the raw, collected time series data in influxdb. Among the raw data
>>>>>> collected from the cars I have to calculate additional time series data
>>>>>> derived from the collected data, like fuel consumption and also single
>>>>>> values like distance, average speed etc. typically when a trip ends.
>>>>>>
>>>>>> I had/have the following ideas in my mind:
>>>>>> 1. My first naive approach was to just simple store every trip's data
>>>>>> in its own measurement (trip_<trip_id>) without any tags but that would
>>>>>> generate a huge amount of series sooner or later. Since then I've learned
>>>>>> that series cardinality matters the most in terms of memory usage and
>>>>>> performance so it's obviously not a good choice. So any option that would
>>>>>> use the trip ids as tag values sounds like a really bad idea because I'll
>>>>>> have a lot of trips.
>>>>>>
>>>>>
>>>>> Encoding metadata in the measurement name is a bad practice. You
>>>>> should have dozens, maybe hundreds of measurements. More than that is a
>>>>> schema smell.
>>>>>
>>>>>
>>>>>
>>>>> Actually in this case, eventually I'd have as many series /
>>>>> measurements as trips which will be millions. So yeah, not a good idea.
>>>>>
>>>>
>>>> A series is in no way equivalent to a measurement. Each measurement is
>>>> a logical container of perhaps thousands or hundreds of thousands of
>>>> series. Each series in a measurement is joined by default for query
>>>> results. Series in two different measurements cannot be joined at all.
>>>>
>>>> You wouldn't have a measurement for each trip. I think you should have
>>>> only one measurement for all trips. You would have a series for each trip,
>>>> however.
>>>>
>>>> Total series count is what matters most for RAM usage.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2. Store all the data in a single measurement and use for example the
>>>>>> car id as a tag. That would generate giant amount of points in a single
>>>>>> measurement (is it a problem?) and would make horizontal partitioning of
>>>>>> the data difficult if I'd need to use multiple hosts / databases later.
>>>>>>
>>>>>
>>>>> 2-3k trips per year per car. Are you monitoring more than a few
>>>>> hundred cars? InfluxDB is fine up to about a million series, and then
>>>>> things get interesting.
>>>>>
>>>>>
>>>>> Not yet, but the number of cars will grow to thousands or perphaps
>>>>> tens of thousands (or more) in time. At least, that's the plan. :)
>>>>>
>>>>>
>>>>> Why not shard the cars into separate instances? That way the series
>>>>> cardinality can be divided up amongst multiple InfluxDB processes. If you
>>>>> need to query across cars this isn't as useful, but if each car is a
>>>>> separate consumer it seems reasonable.
>>>>>
>>>>>
>>>>> Basically, that is what I'm suggesting in 3.
>>>>>
>>>>
>>>> No, the series cardinality concerns are per instance, not per database
>>>> or per measurement. The entire InfluxDB server shares series cardinality.
>>>> The only way to distribute the cardinality load is to put the series on
>>>> different InfluxDB servers, whether different nodes in a cluster or
>>>> different instances entirely.
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> 3. Measurement / per car, no tags. This seems like the best approach
>>>>>> for me at the moment. Horizontal partitioning would be relatively trivial
>>>>>> using car id as shard key if the number of cars on the system increases.
>>>>>> Series cardinality would equal with car cardinality. Obviously in order 
>>>>>> to
>>>>>> get only the data of a single trip (which is important) I'd have to rely 
>>>>>> on
>>>>>> the meta data (the trip's start and end time) stored in the rdbms. That 
>>>>>> is
>>>>>> true for 2. as well.
>>>>>>
>>>>>
>>>>> Too many measurements, and with no tags you're losing 50%+ of the
>>>>> functionality of the query language.
>>>>>
>>>>>
>>>>> I'd have as many measurements (and thereby series) as cars in the
>>>>> system. And not necessarily in one database.
>>>>> I thought only the series cardinality counts, not the measurement
>>>>> cardinality.
>>>>>
>>>>
>>>> Measurements are containers of series, they do not exist independent of
>>>> the series they contain. Think of them as mini-databases within the
>>>> database. Every measurement has at least one series.
>>>>
>>>>
>>>>> What impacts series cardinality the most is the number of different
>>>>> tag values if I understand correctly.
>>>>>
>>>>
>>>> The number of actual combinations of tag values. See
>>>> https://groups.google.com/d/msgid/influxdb/23486fd5-4f40-4ae7-9573-a09eb55e6683%40googlegroups.com
>>>> for more context.
>>>>
>>>>
>>>>> I am not sure how I could utilize tags at his point but maybe I'm
>>>>> missing something.
>>>>>
>>>>>
>>>>> I'm open to new ideas as well. What do you think?
>>>>>>
>>>>>
>>>>> Start with your queries, work backwards from there. What questions
>>>>> does the schema need to support?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The most important one is: give me all the data that belongs to a
>>>>> single trip. (so I can make some calculations from them and generate other
>>>>> series).
>>>>> for 1. this would be select * from trip_<trip_id>
>>>>> for 2. select * from the_one_giant_measurement where car_id =
>>>>> <trips_car_id> and time > <trip_start> and time < <trip_end>
>>>>> for 3. select * from car_<car_id> where time > <trip_start> and time <
>>>>> <trip_end>
>>>>>
>>>>
>>>> I think you want a schema that uses trip_id as a tag, and then you
>>>> shard the trips across multiple instances when cardinality becomes a
>>>> concern.
>>>>
>>>> SELECT * FROM trips WHERE trip_id = '1234'
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks for your time, Sean!
>>>>>
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> --
>>>>>> Remember to include the InfluxDB version number with all issue reports
>>>>>> --- You received this message because you are subscribed to the
>>>>>> Google Groups "InfluxDB" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>>>> To view this discussion on the web visit
>>>>>> <https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com>
>>>>>> https://groups.google.com/d/msgid/influxdb/f5bb0779-f2ac-d9f3-7078-9709d6369714%40gmail.com
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sean Beckett
>>>>> Director of Support and Professional Services
>>>>> InfluxDB
>>>>> --
>>>>> Remember to include the InfluxDB version number with all issue reports
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "InfluxDB" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>>> To view this discussion on the web visit
>>>>> <https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> https://groups.google.com/d/msgid/influxdb/CALGqCvMrxAEUzCVPwuG12aaJkv6Y%3DRKL1pkz%3DMqrLO1NVszwDA%40mail.gmail.com
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>> --
>>>>> Remember to include the InfluxDB version number with all issue reports
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "InfluxDB" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/influxdb.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com
>>>>> <https://groups.google.com/d/msgid/influxdb/d050ba12-d7b9-6f27-b084-fc2d979b9a78%40gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sean Beckett
>>>> Director of Support and Professional Services
>>>> InfluxDB
>>>>
>>> --
>>> Remember to include the InfluxDB version number with all issue reports
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "InfluxDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/influxdb.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com
>>> <https://groups.google.com/d/msgid/influxdb/02e7516f-e3a4-44b1-a928-d974ce66e2b8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Sean Beckett
>> Director of Support and Professional Services
>> InfluxDB
>>
> --
> Remember to include the InfluxDB version number with all issue reports
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/influxdb/14fbb3e3-1c3b-4f12-8416-6d7653ab984e%40googlegroups.com
> <https://groups.google.com/d/msgid/influxdb/14fbb3e3-1c3b-4f12-8416-6d7653ab984e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvOTmNrQnrVHf5FZPDe0xOZ8vLSd3uZWJmgreY2xFuW60g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] schema design advice

Reply via email to