Re: [influxdb] Measurement Schema Design

Sean Beckett Fri, 30 Sep 2016 12:44:29 -0700

On Fri, Sep 30, 2016 at 9:50 AM, Sean Fitts <[email protected]> wrote:

> Sean, hi.
>
> Thanks for the responses.  If you have the time I have some follow-ups
> below...
>
> On Thursday, September 29, 2016 at 8:25:23 PM UTC-7, Sean Beckett wrote:
>>
>> For multi-tenant my first thought is each tenant gets their own database.
>> It does lead to significant series duplication, but it makes for performant
>> add and remove tenant operations. If the cardinality gets too high, some
>> databases can be backed up and restored into a new instance.
>>
>
> Do you have an experience with what a reasonable database cardinality is?
> Are we talking 100's, 1000's, 10,000's?  Is the primary issue here going to
> be the number of open files?
>

I was actually referring to the series cardinality within the databases.
The series cardinality from ALL databases affects the instance. If the
series count starts getting to high, it would make sense to move the db
with the most series to a new instance.

As for how many databases to have, it's more a function how many shard
groups are created. We have users with 1000+ databases without issue. It
does mean there are 1000+ paths in `/var/lib/influxdb/data/` but modern
OSes don't have an issue there. With TSM there aren't very many files open,
so filehandles is not an issue like it was with 0.8 or 0.9.

If you have 1000 databases with 1 RP each having 50 year shard durations,
that's 1000 shard groups on disk per 50 years of operation. If you have 1
database with 5 RPs, each having 1 day shards, that's 1000 shard groups
after 200 days, and it keeps growing from there.

I don't have numbers on when that starts to be a problem, but if you want
to toss out your numbers we can discuss them.

>
>
>>
>> Within the database, have a measurement per sub-system, unless of course
>> you want to enable queries across sub-systems. Otherwise store the
>> subsystem as a tag. That would require unique field names for each
>> subsystem. Field cardinality is not a significant concern, unless there are
>> only a few values per field, per shard.
>>
>
> When you say that storing subsystems as tags would require unique field
> names for each sub-systems I'm not sure I understand why.  If 2 subsystems
> share a particular metric (response time) couldn't there be one field for
> that tagged with the subsystem name?  IIUC that would result in 2 series,
> one for each subsystem.  Note that I'm not sure we'll do this because I
> currently don't see the need to aggregate data across sub-systems, but I
> want to make sure I understand how tags work.
>

There could be one field, yes, but imagine this, where Alice and Bob are
different subsystems:

metrics,system="alice",host="foo" value=12
metrics,system="bob",host="foo" value=200

If you query and specify the system tag, you will get the "value" you
expect. But if you ran `SELECT MEAN(value) FROM metrics` you would get back
a meaningless calculated number for "value". It's not that the database
will have issues, it's that users can get confused. `SELECT MEAN(value)
FROM metrics GROUP BY system` solves the problem, more or less.

>
>
>
>>
>> You can review the Storage Engine
>> <http://docs.influxdata.com/influxdb/v1.0/concepts/storage_engine/#compression>
>> doc for more about field density in TSM files. Writing very sparse fields
>> is not recommended, but querying only a few fields per query is fine. Each
>> measurement + tagset + field is stored in its own series (columnar storage).
>>
>
> Thanks, that provides a good high level overview.  I'm curious about the
> comment wrt writing sparse fields. Given that both the cache and the TSM
> files appear to treat each series as an independent entity, I wouldn't
> think it would matter how sparse either the fields or the points were
> (unless it is recording data for the "gaps".  Sparse points might imply
> more points which I'm guessing could impact the WAL (which if I'm reading
> the doc correctly appears to be a log of the received points).  Clearly I'm
> missing something.
>

The WAL doesn't care about sparse points, really. It's just a cache with a
special in-memory view to sspport queries until the points persist.

Sparse points means that there might be only 10 integer values in a series
in a shard. That's not very many and it won't encode or compress well.

If you have 1000 series with 10 points each in a shard, vs 10 series with
1000 points each, they have the same number of points, but the latter will
be much smaller on disk and faster to query.

Sparse fields are likely queried over longer time ranges, too. So if you
always needed 100 points from those series, then in the sparse example
you'd have to query across 10 shards every time. Better to raise the shard
duration 10-20x so that most queries hit only one shard, and never more
than two. Compression will be better, too.

>
> Thanks again for your help.
>
> Sean
>
> --
> Remember to include the InfluxDB version number with all issue reports
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/c4d7900c-266a-48fd-88e3-2eef2a69d500%40googlegroups.com
> <https://groups.google.com/d/msgid/influxdb/c4d7900c-266a-48fd-88e3-2eef2a69d500%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvOGY7gs40m-QkQbJCbiCfzr3oKt2ytCjrJ8J8iP8%3D%2B4Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Measurement Schema Design

Reply via email to