Re: [Discuss] How to delivery the device concept to users

Xiangdong Huang Tue, 21 Jul 2020 03:47:44 -0700

Hi Jialin,

Yes it is current logic. But I do not know the relation between what you
said and this discussion...


Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jialin Qiao <[email protected]> 于2020年7月21日周二 下午4:47写道：

> Hi,
>
> I would like to give a vision about managing the data files according to
> time partition.
>
> After we introduce the time partition (data is partitioned by time
> interval), we do split them in memory and different TsFiles. But we may
> lake a partition folder layer on top of the TsFiles.
>
> Maybe it should work as follows:
>
> E.g., we insert data into storage group root.sg from 2020-07-19 to
> 2020-07-21 and the partition interval is 1 day.
> First, we create three folders (2020-07-19, 2020-07-20, 2020-07-21) under
> root.sg that belongs to each partition.
> Then, we store TsFiles to its related partition folder.
>
> An example of TsFiles on disk is as follows:
>
> sequence
> ├── root.sg
> │   ├── 2020-07-19
> │   │   └── timestamp1-version1-merge.tsfile
> │   │   └── timestamp1-version1-merge.tsfile.resource
> │   │   └── ...
> │   ├── 2020-07-20
> │   ├── 2020-07-21
>
>
> unsequence(similar with sequence folder)
> ├── root.sg
> │   ├── 2020-07-19
> │   ├── 2020-07-19
> │   ├── 2020-07-19
>
> We only need to store the whole partition folders in memory as a
> List<String>, this memory consumption is negligible.
>
> For the hot partition, e.g., recent 10 days' partition, we could cache
> their TsFileResources in memory to accelerate
> queries.
>
> Then, how to do a query?
>
> Suppose we receive a query: select * from root.sg where time >=
> 2020-07-20 and time <= 2020-07-21
>
> - We could locate two partitions under root.sg that may contains the
> results: 2020-07-20, 2020-07-21
> - Then we traverse the partition folder to get all TsFileResources in this
> partition.
> - Finally, we do queries.
>
> Is this feasible?
>
> Thanks,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Xiangdong Huang" <[email protected]>
> > 发送时间: 2020-07-20 20:03:34 (星期一)
> > 收件人: dev <[email protected]>
> > 抄送:
> > 主题: Re: Re: [Discuss] How to delivery the device concept to users
> >
> > Hi,
> >
> > >  I wonder whether we could index the file by its name. (naming the
> tsfile
> > by date)
> >
> > I think it is a good idea, but maybe not very easy to implement. If we
> can
> > organize the data like this, then it is very very regular and very easy
> to
> > access or delete expired data...
> >
> > > we would need is a tree strucutre where each node has start time / end
> > time for "everything" in the file.
> >
> > This is also a good idea.
> >
> > When we are discussing the granularity of "device", we are worrying about
> > the size of the index, actually.
> > So, we do not care whether there is a so called "sub device", we just
> care
> > how many entities will be indexed.
> >
> > Suppose an IoTDB instance can bear 1 million index entries <some_id ->
> > (start time, end time)>,  and given a tree schema, if there are about 1
> > million nodes from level 0 to level 3, then we can index the nodes on
> > level3 (so level 3 is so-called "device" in current version).
> >
> > Meantime, index the nodes from level0 to level2, as Julian proposed, is
> > also beneficial.
> >
> > The nature of the above idea is letting IoTDB decides which are "devices"
> > automatically.
> >
> > At the beginning of this discussion, I just want to let user claim which
> > are "devices" (or, which prefixes of Paths have time indexes.. but this
> > kind of description may be not user friendly..). As it is more easy....
> but
> > may carry risk if the user set too many devices.
> >
> > Best,
> > -----------------------------------
> > Xiangdong Huang
> > School of Software, Tsinghua University
> >
> >  黄向东
> > 清华大学 软件学院
> >
> >
> > [email protected] <[email protected]> 于2020年7月20日周一 下午7:47写道：
> >
> > > Hi，
> > >
> > > > I wonder whether we could index the file by its name. (naming the
> tsfile
> > > by date) E.g., we store each day's data in one file and name it as
> > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> memory,
> > > we just need to check whether the file exist in the queried interval.
> > >
> > > So, how to deal with the out of order data? Could you give more
> details.
> > >
> > >
> > >
> > > Thanks!
> > >
> > > [email protected]
> > >
> > >
> > > From: Jialin Qiao
> > > Date: 2020-07-20 18:21
> > > To: dev
> > > Subject: Re: [Discuss] How to delivery the device concept to users
> > > Hi,
> > >
> > > > The question I would ask is why "devices" hurt us.
> > >
> > > I'd like to introduce this a bit. For each storage group, we flush the
> > > memtable into TsFiles one by one. For each TsFile, we maintain a
> temporal
> > > index on device level in memory. Suppose there are 3 devices in one
> TsFile,
> > > the index is like this:
> > >
> > > start time array: long[3] = {1, 1, 2}
> > > end time array: long[3] = {5, 6, 10}
> > > devicesToIndexInArray: Map<String, Integer> = {"root.sg.d1" -> 0,
> > > "root.sg.d2" -> 1, "root.sg.d3" -> 2}
> > >
> > > If we have millions of devices, for each TsFile, this index will reach
> > > dozens of MB in memory. Although we could introduce the persistence of
> the
> > > index. It is still recommended to decrease the number of devices.
> > >
> > > I wonder whether we could index the file by its name. (naming the
> tsfile
> > > by date) E.g., we store each day's data in one file and name it as
> > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> memory,
> > > we just need to check whether the file exist in the queried interval.
> > >
> > > Thanks,
> > > --
> > > Jialin Qiao
> > > School of Software, Tsinghua University
> > >
> > > 乔嘉林
> > > 清华大学 软件学院
> > >
> > > > -----原始邮件-----
> > > > 发件人: "Julian Feinauer" <[email protected]>
> > > > 发送时间: 2020-07-20 17:34:40 (星期一)
> > > > 收件人: "[email protected]" <[email protected]>
> > > > 抄送:
> > > > 主题: Re: [Discuss] How to delivery the device concept to users
> > > >
> > > > Hey Jialin, xinagdong,
> > > >
> > > > very good question!
> > > >
> > > > And I tend to agree with Xiangdong.
> > > > If the users do it that way it probably makes most sense for them.
> > > > The question I would ask is why "devices" hurt us (I know a bit about
> > > the implementation of course but probably we have to adopt our
> datamodel
> > > also a bit in the future).
> > > >
> > > > Generally speaking, form e it also makes sense tob e allowed to have
> > > "subcategories" below my devices as my devices usually are "big".
> > > > And technically speaking in the current version this is totally
> possible
> > > to have nested structures below devices or measurements (but these will
> > > then again be devices).
> > > >
> > > > So my question is:
> > > > - Do we really need the static construct of a "device" or can we
> > > probably use a different datastructure where I "select" my device only
> at
> > > query time and we just select everything under that tree as ist
> > > measurements or "sub-measurements" in cases of nesting.
> > > >
> > > > WDYT?
> > > >
> > > > Julian
> > > >
> > > > Am 20.07.20, 09:34 schrieb "Xiangdong Huang" <[email protected]>:
> > > >
> > > >     Hi,
> > > >
> > > >     This is a quite good topic!
> > > >
> > > >     1. maybe we should hear more users opinions.
> > > >
> > > >     For me, I think emphasize the concept of "device" is good. We can
> > > even
> > > >     expose the concept in our APIs.
> > > >
> > > >     2.
> > > >
> > > >     > A more efficient way is
> > > >     > root.sg.device1.measurement1_int0
> > > >     > root.sg.device1.measurement1_int1
> > > >     >  root.sg.device1.measurement1_int2
> > > >     > root.sg.device1.measurement2_long
> > > >
> > > >     I think the more efficient way is:
> > > >
> > > >     root.sg.device1.measurement1.0
> > > >     root.sg.device1.measurement1.1
> > > >     root.sg.device1.measurement1.2
> > > >     root.sg.device1.measurement2
> > > >
> > > >     And, as you said "a device has a sensor that collects some data
> in
> > > array
> > > >     format (int[3]) and some in long type",
> > > >     will the user query just one element from the int[3]? If not, a
> > > better
> > > >     schema is:
> > > >
> > > >     root.sg.device1.measurement1 (the dataType is int[])
> > > >     root.sg.device1.measurement2 (the dataType is long)
> > > >
> > > >     Best,
> > > >     -----------------------------------
> > > >     Xiangdong Huang
> > > >     School of Software, Tsinghua University
> > > >
> > > >      黄向东
> > > >     清华大学 软件学院
> > > >
> > > >
> > > >     Jialin Qiao <[email protected]> 于2020年7月20日周一
> 下午3:28写道：
> > > >
> > > >     > Hi
> > > >     >
> > > >     > Recently, I find that some users create timeseries do not
> > > following the
> > > >     > real world semantic of device
> > > >     >
> > > >     >
> > > >     > E.g., a device has a sensor that collects some data in array
> format
> > > >     > (int[3]) and some in long type.
> > > >     >
> > > >     >
> > > >     > Many users will create timeseries like this:
> > > >     >
> > > >     >
> > > >     > root.sg.device1.measurement1.int0
> > > >     > root.sg.device1.measurement1.int1
> > > >     > root.sg.device1.measurement1.int2
> > > >     > root.sg.device1.measurement2.long
> > > >     >
> > > >     >
> > > >     > As a consequence, there will be two devices instead of one
> device.
> > > This
> > > >     > will cause the real number of devices is much bigger than the
> real
> > > devices
> > > >     > they thought. The drawback is: more devices leads to more
> memory
> > > >     > consumption.
> > > >     >
> > > >     >
> > > >     > A more efficient way is
> > > >     >
> > > >     >
> > > >     > root.sg.device1.measurement1_int0
> > > >     > root.sg.device1.measurement1_int1
> > > >     > root.sg.device1.measurement1_int2
> > > >     > root.sg.device1.measurement2_long
> > > >     >
> > > >     >
> > > >     > In this schema, there will be only one device and 4
> measurements.
> > > >     >
> > > >     >
> > > >     > The problem is we extract the device id automatically. Users
> > > usually do
> > > >     > not have a clear concept about "device". Should we emphasize
> the
> > > concept of
> > > >     > device by letting users create device manually?
> > > >     >
> > > >     >
> > > >     > What do you think?
> > > >     >
> > > >     > Thanks,
> > > >     > --
> > > >     > Jialin Qiao
> > > >     > School of Software, Tsinghua University
> > > >     >
> > > >     > 乔嘉林
> > > >     > 清华大学 软件学院
> > > >
> > >
>

Re: [Discuss] How to delivery the device concept to users

Reply via email to