Re: Re: An easier way to create time series.

Xiangdong Huang Wed, 14 Aug 2019 20:52:48 -0700

Hi,

This is really an interesting topic.

I just want to list some use cases and the term different users use
according to my knowledge.

Indeed different users use different terms to describe the same thing:

1. In many industrial cases, users will say a "machine/device" has some
"sensors",
while a sensor may generate several "conditions" (I mean, the state of sth,
and I am not sure whether the English word is correct) (工况 in Chinese).
For example, in an excavator management applications, these "condition"s
are like: the speed, the mileage, the malfunction state of sth....

2. Some users say a "machine/device" has some "variable" (变量 in Chinese)..
For example, a subway train has more than 3000 variables now. The
"variable"
here is equal with "conditions".

3. Some users say  "metering data" (测点 in Chinese）.. For example, an
electricity meter has more than 100 "metering data". e.g., the A/B/C
voltage..

They do not say "source" or "data source". However, if you told them your
"machine/device" is a "source", they will agree.

In my understanding, a "source data source" is a specific object that can
generate data..

Now let's review the concept[1] of InfluxDB.

The schema of InfluxDB is:  database - Measurement - {tags}, {fields}.
And, there is a logical concept, series:  "In InfluxDB, a series is the
collection of data that share a retention policy, measurement, and tag
set."[1]

- "Database" is for separating different applications, I think. IoTDB has
no such a concept, we can introduce it in a proper time.

- "Measurement": we need to discuss about it in detail.
Let's look at some examples, In [1], “census” is a measurement. In [2],
"CPU" (or "cpu_load_short")  is a measurement.
Then,  there are many machines that have CPU, e.g., "host=serverA,
region=us_west", "host=serverB, region=us_east".
So, who is a data source,  CPU?

In my opinion, I think CPU is just a kind of data source. "The CPU of
serverA in us_west" is an actually data source, i.e., the "device" in Tian
Jiang's view, and the "series" in InfluxDB's view [1].

Then let's look at the fields. There are many metrics (OK, I used another
term, metrics... I can not distinguish the difference between it with
"variable/condition" in English) in CPU, e.g., the load utilization, the
temperature. That is to say, A physical CPU does not just generate one
value at a time, it can generate several values, while each one is a
variable/condition/measurement point (变量/工况/测点)...

(As InfluxDB uses columnar file format, it will store all "load
utilization" data that belongs to the same device together, and then store
all "temperature" data..)

According to the examples, I think a good practice of InfluxDB is, if a set
of "series" (the concept of InfluxDB) has the same "fields", they belong to
a measurement. Or, in a measurement, the cognominal "field"s on different
"series" have the same meaning, and should have the same  data type (e.g,
double, int, bool...).

Now let's look at the schema of IoTDB.

"Measurement": now in IoTDB, the "measurement" means a
variable/condition... It is the similar with "field" in InfluxDB...

Storage group: storage group now plays two roles: (1) data in different
storage groups will be stored in different files. (In the future, we can
also introduce the replication_refactor, the retention/TTL policy on
different storage groups); (2) the cognominal "measurement" on different
"device"s (in IoTDB) that belong to the same storage group should have the
same meaning and data type (BTW, can this restriction be canceled?)

"Series"/"Path": In IoTDB, a complete "path" is a "series", e.g.,
"root.cpu.serverA.us_west.load_utilization",
"root.cpu.serverB.us_west.temperature". That is to say, "series" in IoTDB =
"series" in InfluxDB + "field" in InfluxDB...

"Device": actually now IoTDB does not explicitly introduce the concept of
"device".. But when we developed codes, we thought a "path" without
"measurement" (i.e., the "field" in InfluxDB) is a "device" (or we can
called it as machine). In my opinion, "device" is a really data source,
things like "CPU" is just a kind of data source.
(Analogy,  from a data analyzer's view, MySQL is not a data source,
username@MySQL_IP:port with password is a valid data source).

Device Template:
I think it is useful. Because in real world, machines/devices are built by
product line. A set of machine/device that have the same hardware and
software can collect the same variables... Actually, when transferring the
data from the machine to the data center,  the manufacturers always have
their protocols, e.g., the first 4 bytes refer to the machine ID, the next
4 bytes refer to the speed, the next 1 byte refers to whether a malfunction
occurs.... There are many devices/machines that share the same protocol and
these machines should have the same measurement (in IoTDB, fields in
InfluxDB). So it is good to just define the concept (e.g., the data type,
and the encoding type) of these measurements once.
That is what  Tian Jiang want to do I think.

Lets' think more, is it true that  the devices that belong to the same
model always have the same fields? The answer is no. For example, I have a
Nikon 5200 camera. After using it several months, I find that it is not a
good experience that there is no GPS info in my photo, therefore, I bought
a GPS  accessory and attached it to the camera. Now my Nikon 5200 has more
"field"s than others... So, we have to retain the ability to add a "field"
on a "device"..

Tags vs Path.
Though they can achieve the same effect, and actually we can transfer a
tag-based schema to a path-based schema, I have to say, tag-based schema is
more flexible and friendly to users. Path looks like a conventional style,
that is, the first word in a path is "root", the second word in a path is
tagkey1, the third word in a path is tagkey2, etc.. So, users do not write
"host=" and "region=" again and again. But the inconvenience is that users
have to know the meaning of each position in a path....

In conclusion, I just state the fact about the difference of IoTDB and
InfluxDB.
It is truth that tag-based schema is more popular:  In industrial domain, I
think it is fine for users to use path or tag... In other domain,
especially in IT domain, tags and field is more popular.
What we should do is rethinking whether a path-based schema is better. If
so, let's popularize it, if not, let's embrace another...

By the way, in future, I think IoTDB should support  array data type for a
measurement, then we can define a series like "create timeseries
root.us_west.serverA.cpu with datatype=flout_array[2]", while the array[0]
refers to the utilization and array[1] refers to the temperature..

Best,

Ref:
[1] https://docs.influxdata.com/influxdb/v1.7/concepts/key_concepts/
[2] https://docs.influxdata.com/influxdb/v1.7/introduction/getting-started/

-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院

Tian Jiang <[email protected]> 于2019年8月14日周三 下午6:18写道：

> Hi Julian,
>
>
> Surely naming is important to users, but different users may have
> different opinions upon naming. I think it is hard to get everyone
> satisfied, so maybe we can hold a vote or something to discuss that later.
>
>
> My starting point is to provide a way to create a bunch of time series
> with fewer statements(as the title suggests) and this will not interfere
> with existing functionality. It is light-weight, and I can add this feature
> within a day or two.
>
>
> Adding tags is cool, which can definitely enhance the expressive power of
> IoTDB, But, the implementation may cause a lot of changes(and potential
> troubles) in the whole system, which is beyond the discussion. Since you
> seem interested, we may open another thread to discuss about the tags(or
> whatever you want to call it) in detail.
>
>
> Tian Jiang
>
>
>
> At 2019-08-14 17:12:41, "Julian Feinauer" <[email protected]>
> wrote:
> >Hi Tian,
> >
> >naming i see as a minor issue to change but as a bigger issue to users
> (nomen est omen...).
> >Regarding your other comment I don’t get what you mean.
> >
> >Think of situations like monitoring stuff from several machines of
> multiple types in multiple plants.
> >Then I would like to say something like
> >
> >"do that for all series in plan A" or "in all series for machine Type X".
> >
> >Indeed its quite a "huge" change which has implications but it would
> rather "widen" the api to do "multi-series querying" rather than change it,
> I guess?
> >
> >Julian
> >
> >Am 14.08.19, 11:06 schrieb "Tian Jiang" <[email protected]>:
> >
> >    The naming is not a big issue, but your schema proposals seem to be
> turning IoTDB into something else.
> >
> >
> >
> >
> >
> >    At 2019-08-14 16:55:33, "Jialin Qiao" <[email protected]>
> wrote:
> >    >Hi,
> >    >
> >    >I think source or datasource is good, and it's better to use, or at
> least add the tags and fields, because many TSDBs use these conceptual
> module.
> >    >
> >    >Some feasible schema organization ways and "select * from the table"
> results:
> >    >
> >    >(1) Each type of datasource is a table, which has a time column,
> some tag columns and some field columns.
> >    >
> >    >Table: sourceType
> >    >time tag1,  field1, field2
> >    >1, device1, 1, 1
> >    >2, device1, 2, 2
> >    >2, device2, 2, 2
> >    >
> >    >(2) Each datasource is a table with some tags. Each table has a time
> column, and some field columns. (Tags of one datasource may be not changed,
> so just see it as metadata.)
> >    >
> >    >Table: source1(tag1=device1)
> >    >time, field1, field2
> >    >1, 1, 1
> >    >2, 2, 2
> >    >
> >    >Table: source2(tag1=device2)
> >    >time, field1, field2
> >    >2, 2, 2
> >    >
> >    >
> >    >Best,
> >    >--
> >    >Jialin Qiao
> >    >School of Software, Tsinghua University
> >    >
> >    >乔嘉林
> >    >清华大学 软件学院
> >    >
> >    >> -----原始邮件-----
> >    >> 发件人: "Julian Feinauer" <[email protected]>
> >    >> 发送时间: 2019-08-14 16:10:22 (星期三)
> >    >> 收件人: "[email protected]" <[email protected]>
> >    >> 抄送:
> >    >> 主题: Re: An easier way to create time series.
> >    >>
> >    >> Hi,
> >    >>
> >    >> let me stick in hier also.
> >    >> One of the things which was at first a bit "unfamiliar" for me was
> this device focus.
> >    >> It’s a bit to "one-dimensional" in my perspective.
> >    >>
> >    >> Personally, I quite like how Influx does it that you have a name
> and can attach tags and fields to it.
> >    >> And even if we do not do it that way I would prefer to name it a
> bit differently as "series" or "measurement" or "source".
> >    >> Device is a bit specific and just sounds odd, from a users
> perspective.
> >    >>
> >    >> I think it was good to keep it that way for 0.8.0.
> >    >> But for the next release we are open to break things a bit.
> >    >>
> >    >> What do others think?
> >    >>
> >    >> Julian
> >    >>
> >    >> Am 14.08.19, 04:52 schrieb "Tian Jiang" <[email protected]>:
> >    >>
> >    >>     Maybe starting from a sugar, we can add some improvements
> gradually. Currently, I think making timeseries creation easier should be
> enough. Please share if you have some fancy ideas that can go with the
> introduction of "device".
> >    >>
> >    >>     Tian Jiang
> >    >>
> >    >>
> >    >>     At 2019-08-14 10:44:14, "Xiangdong Huang" <[email protected]>
> wrote:
> >    >>     >Hi,
> >    >>     >
> >    >>     >Looks fine for me.
> >    >>     >
> >    >>     >One question, is it just a language syntax sugar, or we can
> as well as
> >    >>     >improve the schema management? Any idea?
> >    >>     >
> >    >>     >Best,
> >    >>     >-----------------------------------
> >    >>     >Xiangdong Huang
> >    >>     >School of Software, Tsinghua University
> >    >>     >
> >    >>     > 黄向东
> >    >>     >清华大学 软件学院
> >    >>     >
> >    >>     >
> >    >>     >Tian Jiang <[email protected]> 于2019年8月14日周三 上午10:37写道：
> >    >>     >
> >    >>     >> Greetings,
> >    >>     >>
> >    >>     >>
> >    >>     >> In the present version, it is a little trouble some to
> create a set
> >    >>     >> timeseries that has the same measurements. On the other
> hand, although we
> >    >>     >> use the conception "device" in the code, it is not properly
> abstracted.
> >    >>     >>
> >    >>     >> Expected usage:
> >    >>     >>
> >    >>     >> Using IoTDB in a more relational way:
> >    >>     >>
> >    >>     >> CREATE DEVICE TEMPLATE vehicle (speed DOUBLE PLAIN,
> direction DOUBLE
> >    >>     >> PLAIN, temperature DOUBLE PLAIN, fuel DOUBLE PLAIN)
> >    >>     >>
> >    >>     >> If all datatypes(or encodings) are the same, you can write
> the equal form:
> >    >>     >>
> >    >>     >> CREATE DEVICE TEMPLATE vehicle MEASUREMENTS (speed,
> direction,
> >    >>     >> temperature, fuel) DATATYPE DOUBLE ENCODING PLAIN
> >    >>     >>
> >    >>     >> Then you will be able to create time series in an easier
> way:
> >    >>     >>
> >    >>     >> CREATE DEVICE (vehicle) root.sg1.vehicle1
> >    >>     >>
> >    >>     >> Which equals:
> >    >>     >>
> >    >>     >> CREATE TIMESERIES root.sg1.vehicle1.speed WITH
> >    >>     >> DATATYPE=DOUBLE,ENCODING=PLAIN
> >    >>     >>
> >    >>     >> CREATE TIMESERIES root.sg1.vehicle1.direction WITH
> >    >>     >> DATATYPE=DOUBLE,ENCODING=PLAIN
> >    >>     >>
> >    >>     >> CREATE TIMESERIES root.sg1.vehicle1.fuel WITH
> >    >>     >> DATATYPE=DOUBLE,ENCODING=PLAIN
> >    >>     >>
> >    >>     >> CREATE TIMESERIES root.sg1.vehicle1.temperature WITH
> >    >>     >> DATATYPE=DOUBLE,ENCODING=PLAIN
> >    >>     >>
> >    >>     >> I hope this will narrow the gap between using IoTDB and
> traditional
> >    >>     >> relation databases.
> >    >>     >> Jira link:
> >    >>     >>
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-163?filter=allopenissues
> >    >>     >>
> >    >>     >>
> >    >>     >> Tian Jiang
> >    >>
> >    >>
> >
> >
>

Re: Re: An easier way to create time series.

Reply via email to