Re: [Discuss] How to delivery the device concept to users

Jialin Qiao Tue, 21 Jul 2020 05:37:41 -0700

Hi,

This is not the current implementation... We do not have a partition folder on 
disk now. 
By adding a partition folder, there is no need to store all TsFileResources in 
the memory, and the device index will not hurt us.


Thanks,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "Xiangdong Huang" <[email protected]>
> 发送时间: 2020-07-21 18:46:31 (星期二)
> 收件人: dev <[email protected]>
> 抄送: 
> 主题: Re: [Discuss] How to delivery the device concept to users
> 
> Hi Jialin,
> 
> Yes it is current logic. But I do not know the relation between what you
> said and this discussion...
> 
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
> 
>  黄向东
> 清华大学 软件学院
> 
> 
> Jialin Qiao <[email protected]> 于2020年7月21日周二 下午4:47写道：
> 
> > Hi,
> >
> > I would like to give a vision about managing the data files according to
> > time partition.
> >
> > After we introduce the time partition (data is partitioned by time
> > interval), we do split them in memory and different TsFiles. But we may
> > lake a partition folder layer on top of the TsFiles.
> >
> > Maybe it should work as follows:
> >
> > E.g., we insert data into storage group root.sg from 2020-07-19 to
> > 2020-07-21 and the partition interval is 1 day.
> > First, we create three folders (2020-07-19, 2020-07-20, 2020-07-21) under
> > root.sg that belongs to each partition.
> > Then, we store TsFiles to its related partition folder.
> >
> > An example of TsFiles on disk is as follows:
> >
> > sequence
> > ├── root.sg
> > │   ├── 2020-07-19
> > │   │   └── timestamp1-version1-merge.tsfile
> > │   │   └── timestamp1-version1-merge.tsfile.resource
> > │   │   └── ...
> > │   ├── 2020-07-20
> > │   ├── 2020-07-21
> >
> >
> > unsequence(similar with sequence folder)
> > ├── root.sg
> > │   ├── 2020-07-19
> > │   ├── 2020-07-19
> > │   ├── 2020-07-19
> >
> > We only need to store the whole partition folders in memory as a
> > List<String>, this memory consumption is negligible.
> >
> > For the hot partition, e.g., recent 10 days' partition, we could cache
> > their TsFileResources in memory to accelerate
> > queries.
> >
> > Then, how to do a query?
> >
> > Suppose we receive a query: select * from root.sg where time >=
> > 2020-07-20 and time <= 2020-07-21
> >
> > - We could locate two partitions under root.sg that may contains the
> > results: 2020-07-20, 2020-07-21
> > - Then we traverse the partition folder to get all TsFileResources in this
> > partition.
> > - Finally, we do queries.
> >
> > Is this feasible?
> >
> > Thanks,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Xiangdong Huang" <[email protected]>
> > > 发送时间: 2020-07-20 20:03:34 (星期一)
> > > 收件人: dev <[email protected]>
> > > 抄送:
> > > 主题: Re: Re: [Discuss] How to delivery the device concept to users
> > >
> > > Hi,
> > >
> > > >  I wonder whether we could index the file by its name. (naming the
> > tsfile
> > > by date)
> > >
> > > I think it is a good idea, but maybe not very easy to implement. If we
> > can
> > > organize the data like this, then it is very very regular and very easy
> > to
> > > access or delete expired data...
> > >
> > > > we would need is a tree strucutre where each node has start time / end
> > > time for "everything" in the file.
> > >
> > > This is also a good idea.
> > >
> > > When we are discussing the granularity of "device", we are worrying about
> > > the size of the index, actually.
> > > So, we do not care whether there is a so called "sub device", we just
> > care
> > > how many entities will be indexed.
> > >
> > > Suppose an IoTDB instance can bear 1 million index entries <some_id ->
> > > (start time, end time)>,  and given a tree schema, if there are about 1
> > > million nodes from level 0 to level 3, then we can index the nodes on
> > > level3 (so level 3 is so-called "device" in current version).
> > >
> > > Meantime, index the nodes from level0 to level2, as Julian proposed, is
> > > also beneficial.
> > >
> > > The nature of the above idea is letting IoTDB decides which are "devices"
> > > automatically.
> > >
> > > At the beginning of this discussion, I just want to let user claim which
> > > are "devices" (or, which prefixes of Paths have time indexes.. but this
> > > kind of description may be not user friendly..). As it is more easy....
> > but
> > > may carry risk if the user set too many devices.
> > >
> > > Best,
> > > -----------------------------------
> > > Xiangdong Huang
> > > School of Software, Tsinghua University
> > >
> > >  黄向东
> > > 清华大学 软件学院
> > >
> > >
> > > [email protected] <[email protected]> 于2020年7月20日周一 下午7:47写道：
> > >
> > > > Hi，
> > > >
> > > > > I wonder whether we could index the file by its name. (naming the
> > tsfile
> > > > by date) E.g., we store each day's data in one file and name it as
> > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> > memory,
> > > > we just need to check whether the file exist in the queried interval.
> > > >
> > > > So, how to deal with the out of order data? Could you give more
> > details.
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > [email protected]
> > > >
> > > >
> > > > From: Jialin Qiao
> > > > Date: 2020-07-20 18:21
> > > > To: dev
> > > > Subject: Re: [Discuss] How to delivery the device concept to users
> > > > Hi,
> > > >
> > > > > The question I would ask is why "devices" hurt us.
> > > >
> > > > I'd like to introduce this a bit. For each storage group, we flush the
> > > > memtable into TsFiles one by one. For each TsFile, we maintain a
> > temporal
> > > > index on device level in memory. Suppose there are 3 devices in one
> > TsFile,
> > > > the index is like this:
> > > >
> > > > start time array: long[3] = {1, 1, 2}
> > > > end time array: long[3] = {5, 6, 10}
> > > > devicesToIndexInArray: Map<String, Integer> = {"root.sg.d1" -> 0,
> > > > "root.sg.d2" -> 1, "root.sg.d3" -> 2}
> > > >
> > > > If we have millions of devices, for each TsFile, this index will reach
> > > > dozens of MB in memory. Although we could introduce the persistence of
> > the
> > > > index. It is still recommended to decrease the number of devices.
> > > >
> > > > I wonder whether we could index the file by its name. (naming the
> > tsfile
> > > > by date) E.g., we store each day's data in one file and name it as
> > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in
> > memory,
> > > > we just need to check whether the file exist in the queried interval.
> > > >
> > > > Thanks,
> > > > --
> > > > Jialin Qiao
> > > > School of Software, Tsinghua University
> > > >
> > > > 乔嘉林
> > > > 清华大学 软件学院
> > > >
> > > > > -----原始邮件-----
> > > > > 发件人: "Julian Feinauer" <[email protected]>
> > > > > 发送时间: 2020-07-20 17:34:40 (星期一)
> > > > > 收件人: "[email protected]" <[email protected]>
> > > > > 抄送:
> > > > > 主题: Re: [Discuss] How to delivery the device concept to users
> > > > >
> > > > > Hey Jialin, xinagdong,
> > > > >
> > > > > very good question!
> > > > >
> > > > > And I tend to agree with Xiangdong.
> > > > > If the users do it that way it probably makes most sense for them.
> > > > > The question I would ask is why "devices" hurt us (I know a bit about
> > > > the implementation of course but probably we have to adopt our
> > datamodel
> > > > also a bit in the future).
> > > > >
> > > > > Generally speaking, form e it also makes sense tob e allowed to have
> > > > "subcategories" below my devices as my devices usually are "big".
> > > > > And technically speaking in the current version this is totally
> > possible
> > > > to have nested structures below devices or measurements (but these will
> > > > then again be devices).
> > > > >
> > > > > So my question is:
> > > > > - Do we really need the static construct of a "device" or can we
> > > > probably use a different datastructure where I "select" my device only
> > at
> > > > query time and we just select everything under that tree as ist
> > > > measurements or "sub-measurements" in cases of nesting.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Julian
> > > > >
> > > > > Am 20.07.20, 09:34 schrieb "Xiangdong Huang" <[email protected]>:
> > > > >
> > > > >     Hi,
> > > > >
> > > > >     This is a quite good topic!
> > > > >
> > > > >     1. maybe we should hear more users opinions.
> > > > >
> > > > >     For me, I think emphasize the concept of "device" is good. We can
> > > > even
> > > > >     expose the concept in our APIs.
> > > > >
> > > > >     2.
> > > > >
> > > > >     > A more efficient way is
> > > > >     > root.sg.device1.measurement1_int0
> > > > >     > root.sg.device1.measurement1_int1
> > > > >     >  root.sg.device1.measurement1_int2
> > > > >     > root.sg.device1.measurement2_long
> > > > >
> > > > >     I think the more efficient way is:
> > > > >
> > > > >     root.sg.device1.measurement1.0
> > > > >     root.sg.device1.measurement1.1
> > > > >     root.sg.device1.measurement1.2
> > > > >     root.sg.device1.measurement2
> > > > >
> > > > >     And, as you said "a device has a sensor that collects some data
> > in
> > > > array
> > > > >     format (int[3]) and some in long type",
> > > > >     will the user query just one element from the int[3]? If not, a
> > > > better
> > > > >     schema is:
> > > > >
> > > > >     root.sg.device1.measurement1 (the dataType is int[])
> > > > >     root.sg.device1.measurement2 (the dataType is long)
> > > > >
> > > > >     Best,
> > > > >     -----------------------------------
> > > > >     Xiangdong Huang
> > > > >     School of Software, Tsinghua University
> > > > >
> > > > >      黄向东
> > > > >     清华大学 软件学院
> > > > >
> > > > >
> > > > >     Jialin Qiao <[email protected]> 于2020年7月20日周一
> > 下午3:28写道：
> > > > >
> > > > >     > Hi
> > > > >     >
> > > > >     > Recently, I find that some users create timeseries do not
> > > > following the
> > > > >     > real world semantic of device
> > > > >     >
> > > > >     >
> > > > >     > E.g., a device has a sensor that collects some data in array
> > format
> > > > >     > (int[3]) and some in long type.
> > > > >     >
> > > > >     >
> > > > >     > Many users will create timeseries like this:
> > > > >     >
> > > > >     >
> > > > >     > root.sg.device1.measurement1.int0
> > > > >     > root.sg.device1.measurement1.int1
> > > > >     > root.sg.device1.measurement1.int2
> > > > >     > root.sg.device1.measurement2.long
> > > > >     >
> > > > >     >
> > > > >     > As a consequence, there will be two devices instead of one
> > device.
> > > > This
> > > > >     > will cause the real number of devices is much bigger than the
> > real
> > > > devices
> > > > >     > they thought. The drawback is: more devices leads to more
> > memory
> > > > >     > consumption.
> > > > >     >
> > > > >     >
> > > > >     > A more efficient way is
> > > > >     >
> > > > >     >
> > > > >     > root.sg.device1.measurement1_int0
> > > > >     > root.sg.device1.measurement1_int1
> > > > >     > root.sg.device1.measurement1_int2
> > > > >     > root.sg.device1.measurement2_long
> > > > >     >
> > > > >     >
> > > > >     > In this schema, there will be only one device and 4
> > measurements.
> > > > >     >
> > > > >     >
> > > > >     > The problem is we extract the device id automatically. Users
> > > > usually do
> > > > >     > not have a clear concept about "device". Should we emphasize
> > the
> > > > concept of
> > > > >     > device by letting users create device manually?
> > > > >     >
> > > > >     >
> > > > >     > What do you think?
> > > > >     >
> > > > >     > Thanks,
> > > > >     > --
> > > > >     > Jialin Qiao
> > > > >     > School of Software, Tsinghua University
> > > > >     >
> > > > >     > 乔嘉林
> > > > >     > 清华大学 软件学院
> > > > >
> > > >
> >

Re: [Discuss] How to delivery the device concept to users

Reply via email to