Re: [Discuss] How to delivery the device concept to users

Julian Feinauer Mon, 20 Jul 2020 03:54:02 -0700

Hi,

another idea I just had would be to store the complete tree (as shown below) 
but only store the value explicity if the value differs for a parent.
In many situations discussed we had devices with a static substructure which 
would mean that there is nearly no storage overhead.


In the example below we could omit the explicit values for root only 

    Root (*, *) 
            sg (1, 10)
                  d1 (1, 5)
                  d2 (1, 6)
                  d3 (2, 10)

But in other situations this approach could probably help to save a lot more?

Julian


Am 20.07.20, 12:21 schrieb "Julian Feinauer" <[email protected]>:

    Thanks fort he clear explanation, yes I remember that there were also 
reported performance issues with that.

    But to generalize the concept of a device all we would need is a tree 
strucutre where each node has start time / end time for "everything" in the 
file.
    Like in your example:

    Root (1, 10) 
            sg (1, 10)
                  d1 (1, 5)
                  d2 (1, 6)
                  d3 (2, 10)

    This would allow us to fetch the necessary information on each level.

    What about using some kind of simple KV Store which is ofc disk based but 
does its own in memory caching and optimization such that frequent accesses 
("hot devices" or more generally "paths") are fast.

    Is this a valid idea?

    Julian

    Am 20.07.20, 11:53 schrieb "Jialin Qiao" <[email protected]>:

        Hi,

        > The question I would ask is why "devices" hurt us.

        I'd like to introduce this a bit. For each storage group, we flush the 
memtable into TsFiles one by one. For each TsFile, we maintain a temporal index 
on device level in memory. Suppose there are 3 devices in one TsFile, the index 
is like this:

        start time array: long[3] = {1, 1, 2}
        end time array: long[3] = {5, 6, 10}
        devicesToIndexInArray: Map<String, Integer> = {"root.sg.d1" -> 0, 
"root.sg.d2" -> 1, "root.sg.d3" -> 2}

        If we have millions of devices, for each TsFile, this index will reach 
dozens of MB in memory. Although we could introduce the persistence of the 
index. It is still recommended to decrease the number of devices.

        I wonder whether we could index the file by its name. (naming the 
tsfile by date) E.g., we store each day's data in one file and name it as 
sg-2020-07-20.TsFile. Then, we do not need to maintain the index in memory, we 
just need to check whether the file exist in the queried interval.

        Thanks,
        --
        Jialin Qiao
        School of Software, Tsinghua University

        乔嘉林
        清华大学 软件学院

        > -----原始邮件-----
        > 发件人: "Julian Feinauer" <[email protected]>
        > 发送时间: 2020-07-20 17:34:40 (星期一)
        > 收件人: "[email protected]" <[email protected]>
        > 抄送: 
        > 主题: Re: [Discuss] How to delivery the device concept to users
        > 
        > Hey Jialin, xinagdong,
        > 
        > very good question!
        > 
        > And I tend to agree with Xiangdong.
        > If the users do it that way it probably makes most sense for them.
        > The question I would ask is why "devices" hurt us (I know a bit about 
the implementation of course but probably we have to adopt our datamodel also a 
bit in the future).
        > 
        > Generally speaking, form e it also makes sense tob e allowed to have 
"subcategories" below my devices as my devices usually are "big".
        > And technically speaking in the current version this is totally 
possible to have nested structures below devices or measurements (but these 
will then again be devices).
        > 
        > So my question is:
        > - Do we really need the static construct of a "device" or can we 
probably use a different datastructure where I "select" my device only at query 
time and we just select everything under that tree as ist measurements or 
"sub-measurements" in cases of nesting.
        > 
        > WDYT?
        > 
        > Julian
        > 
        > Am 20.07.20, 09:34 schrieb "Xiangdong Huang" <[email protected]>:
        > 
        >     Hi,
        > 
        >     This is a quite good topic!
        > 
        >     1. maybe we should hear more users opinions.
        > 
        >     For me, I think emphasize the concept of "device" is good. We can 
even
        >     expose the concept in our APIs.
        > 
        >     2.
        > 
        >     > A more efficient way is
        >     > root.sg.device1.measurement1_int0
        >     > root.sg.device1.measurement1_int1
        >     >  root.sg.device1.measurement1_int2
        >     > root.sg.device1.measurement2_long
        > 
        >     I think the more efficient way is:
        > 
        >     root.sg.device1.measurement1.0
        >     root.sg.device1.measurement1.1
        >     root.sg.device1.measurement1.2
        >     root.sg.device1.measurement2
        > 
        >     And, as you said "a device has a sensor that collects some data 
in array
        >     format (int[3]) and some in long type",
        >     will the user query just one element from the int[3]? If not, a 
better
        >     schema is:
        > 
        >     root.sg.device1.measurement1 (the dataType is int[])
        >     root.sg.device1.measurement2 (the dataType is long)
        > 
        >     Best,
        >     -----------------------------------
        >     Xiangdong Huang
        >     School of Software, Tsinghua University
        > 
        >      黄向东
        >     清华大学 软件学院
        > 
        > 
        >     Jialin Qiao <[email protected]> 于2020年7月20日周一 下午3:28写道：
        > 
        >     > Hi
        >     >
        >     > Recently, I find that some users create timeseries do not 
following the
        >     > real world semantic of device
        >     >
        >     >
        >     > E.g., a device has a sensor that collects some data in array 
format
        >     > (int[3]) and some in long type.
        >     >
        >     >
        >     > Many users will create timeseries like this:
        >     >
        >     >
        >     > root.sg.device1.measurement1.int0
        >     > root.sg.device1.measurement1.int1
        >     > root.sg.device1.measurement1.int2
        >     > root.sg.device1.measurement2.long
        >     >
        >     >
        >     > As a consequence, there will be two devices instead of one 
device. This
        >     > will cause the real number of devices is much bigger than the 
real devices
        >     > they thought. The drawback is: more devices leads to more memory
        >     > consumption.
        >     >
        >     >
        >     > A more efficient way is
        >     >
        >     >
        >     > root.sg.device1.measurement1_int0
        >     > root.sg.device1.measurement1_int1
        >     > root.sg.device1.measurement1_int2
        >     > root.sg.device1.measurement2_long
        >     >
        >     >
        >     > In this schema, there will be only one device and 4 
measurements.
        >     >
        >     >
        >     > The problem is we extract the device id automatically. Users 
usually do
        >     > not have a clear concept about "device". Should we emphasize 
the concept of
        >     > device by letting users create device manually?
        >     >
        >     >
        >     > What do you think?
        >     >
        >     > Thanks,
        >     > --
        >     > Jialin Qiao
        >     > School of Software, Tsinghua University
        >     >
        >     > 乔嘉林
        >     > 清华大学 软件学院
        >

Re: [Discuss] How to delivery the device concept to users

Reply via email to