Hi, This is not the current implementation... We do not have a partition folder on disk now. By adding a partition folder, there is no need to store all TsFileResources in the memory, and the device index will not hurt us.
Thanks, -- Jialin Qiao School of Software, Tsinghua University 乔嘉林 清华大学 软件学院 > -----原始邮件----- > 发件人: "Xiangdong Huang" <saint...@gmail.com> > 发送时间: 2020-07-21 18:46:31 (星期二) > 收件人: dev <dev@iotdb.apache.org> > 抄送: > 主题: Re: [Discuss] How to delivery the device concept to users > > Hi Jialin, > > Yes it is current logic. But I do not know the relation between what you > said and this discussion... > > Best, > ----------------------------------- > Xiangdong Huang > School of Software, Tsinghua University > > 黄向东 > 清华大学 软件学院 > > > Jialin Qiao <qj...@mails.tsinghua.edu.cn> 于2020年7月21日周二 下午4:47写道: > > > Hi, > > > > I would like to give a vision about managing the data files according to > > time partition. > > > > After we introduce the time partition (data is partitioned by time > > interval), we do split them in memory and different TsFiles. But we may > > lake a partition folder layer on top of the TsFiles. > > > > Maybe it should work as follows: > > > > E.g., we insert data into storage group root.sg from 2020-07-19 to > > 2020-07-21 and the partition interval is 1 day. > > First, we create three folders (2020-07-19, 2020-07-20, 2020-07-21) under > > root.sg that belongs to each partition. > > Then, we store TsFiles to its related partition folder. > > > > An example of TsFiles on disk is as follows: > > > > sequence > > ├── root.sg > > │ ├── 2020-07-19 > > │ │ └── timestamp1-version1-merge.tsfile > > │ │ └── timestamp1-version1-merge.tsfile.resource > > │ │ └── ... > > │ ├── 2020-07-20 > > │ ├── 2020-07-21 > > > > > > unsequence(similar with sequence folder) > > ├── root.sg > > │ ├── 2020-07-19 > > │ ├── 2020-07-19 > > │ ├── 2020-07-19 > > > > We only need to store the whole partition folders in memory as a > > List<String>, this memory consumption is negligible. > > > > For the hot partition, e.g., recent 10 days' partition, we could cache > > their TsFileResources in memory to accelerate > > queries. > > > > Then, how to do a query? > > > > Suppose we receive a query: select * from root.sg where time >= > > 2020-07-20 and time <= 2020-07-21 > > > > - We could locate two partitions under root.sg that may contains the > > results: 2020-07-20, 2020-07-21 > > - Then we traverse the partition folder to get all TsFileResources in this > > partition. > > - Finally, we do queries. > > > > Is this feasible? > > > > Thanks, > > -- > > Jialin Qiao > > School of Software, Tsinghua University > > > > 乔嘉林 > > 清华大学 软件学院 > > > > > -----原始邮件----- > > > 发件人: "Xiangdong Huang" <saint...@gmail.com> > > > 发送时间: 2020-07-20 20:03:34 (星期一) > > > 收件人: dev <dev@iotdb.apache.org> > > > 抄送: > > > 主题: Re: Re: [Discuss] How to delivery the device concept to users > > > > > > Hi, > > > > > > > I wonder whether we could index the file by its name. (naming the > > tsfile > > > by date) > > > > > > I think it is a good idea, but maybe not very easy to implement. If we > > can > > > organize the data like this, then it is very very regular and very easy > > to > > > access or delete expired data... > > > > > > > we would need is a tree strucutre where each node has start time / end > > > time for "everything" in the file. > > > > > > This is also a good idea. > > > > > > When we are discussing the granularity of "device", we are worrying about > > > the size of the index, actually. > > > So, we do not care whether there is a so called "sub device", we just > > care > > > how many entities will be indexed. > > > > > > Suppose an IoTDB instance can bear 1 million index entries <some_id -> > > > (start time, end time)>, and given a tree schema, if there are about 1 > > > million nodes from level 0 to level 3, then we can index the nodes on > > > level3 (so level 3 is so-called "device" in current version). > > > > > > Meantime, index the nodes from level0 to level2, as Julian proposed, is > > > also beneficial. > > > > > > The nature of the above idea is letting IoTDB decides which are "devices" > > > automatically. > > > > > > At the beginning of this discussion, I just want to let user claim which > > > are "devices" (or, which prefixes of Paths have time indexes.. but this > > > kind of description may be not user friendly..). As it is more easy.... > > but > > > may carry risk if the user set too many devices. > > > > > > Best, > > > ----------------------------------- > > > Xiangdong Huang > > > School of Software, Tsinghua University > > > > > > 黄向东 > > > 清华大学 软件学院 > > > > > > > > > runhus...@foxmail.com <runhus...@foxmail.com> 于2020年7月20日周一 下午7:47写道: > > > > > > > Hi, > > > > > > > > > I wonder whether we could index the file by its name. (naming the > > tsfile > > > > by date) E.g., we store each day's data in one file and name it as > > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in > > memory, > > > > we just need to check whether the file exist in the queried interval. > > > > > > > > So, how to deal with the out of order data? Could you give more > > details. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > runhus...@foxmail.com > > > > > > > > > > > > From: Jialin Qiao > > > > Date: 2020-07-20 18:21 > > > > To: dev > > > > Subject: Re: [Discuss] How to delivery the device concept to users > > > > Hi, > > > > > > > > > The question I would ask is why "devices" hurt us. > > > > > > > > I'd like to introduce this a bit. For each storage group, we flush the > > > > memtable into TsFiles one by one. For each TsFile, we maintain a > > temporal > > > > index on device level in memory. Suppose there are 3 devices in one > > TsFile, > > > > the index is like this: > > > > > > > > start time array: long[3] = {1, 1, 2} > > > > end time array: long[3] = {5, 6, 10} > > > > devicesToIndexInArray: Map<String, Integer> = {"root.sg.d1" -> 0, > > > > "root.sg.d2" -> 1, "root.sg.d3" -> 2} > > > > > > > > If we have millions of devices, for each TsFile, this index will reach > > > > dozens of MB in memory. Although we could introduce the persistence of > > the > > > > index. It is still recommended to decrease the number of devices. > > > > > > > > I wonder whether we could index the file by its name. (naming the > > tsfile > > > > by date) E.g., we store each day's data in one file and name it as > > > > sg-2020-07-20.TsFile. Then, we do not need to maintain the index in > > memory, > > > > we just need to check whether the file exist in the queried interval. > > > > > > > > Thanks, > > > > -- > > > > Jialin Qiao > > > > School of Software, Tsinghua University > > > > > > > > 乔嘉林 > > > > 清华大学 软件学院 > > > > > > > > > -----原始邮件----- > > > > > 发件人: "Julian Feinauer" <j.feina...@pragmaticminds.de> > > > > > 发送时间: 2020-07-20 17:34:40 (星期一) > > > > > 收件人: "dev@iotdb.apache.org" <dev@iotdb.apache.org> > > > > > 抄送: > > > > > 主题: Re: [Discuss] How to delivery the device concept to users > > > > > > > > > > Hey Jialin, xinagdong, > > > > > > > > > > very good question! > > > > > > > > > > And I tend to agree with Xiangdong. > > > > > If the users do it that way it probably makes most sense for them. > > > > > The question I would ask is why "devices" hurt us (I know a bit about > > > > the implementation of course but probably we have to adopt our > > datamodel > > > > also a bit in the future). > > > > > > > > > > Generally speaking, form e it also makes sense tob e allowed to have > > > > "subcategories" below my devices as my devices usually are "big". > > > > > And technically speaking in the current version this is totally > > possible > > > > to have nested structures below devices or measurements (but these will > > > > then again be devices). > > > > > > > > > > So my question is: > > > > > - Do we really need the static construct of a "device" or can we > > > > probably use a different datastructure where I "select" my device only > > at > > > > query time and we just select everything under that tree as ist > > > > measurements or "sub-measurements" in cases of nesting. > > > > > > > > > > WDYT? > > > > > > > > > > Julian > > > > > > > > > > Am 20.07.20, 09:34 schrieb "Xiangdong Huang" <saint...@gmail.com>: > > > > > > > > > > Hi, > > > > > > > > > > This is a quite good topic! > > > > > > > > > > 1. maybe we should hear more users opinions. > > > > > > > > > > For me, I think emphasize the concept of "device" is good. We can > > > > even > > > > > expose the concept in our APIs. > > > > > > > > > > 2. > > > > > > > > > > > A more efficient way is > > > > > > root.sg.device1.measurement1_int0 > > > > > > root.sg.device1.measurement1_int1 > > > > > > root.sg.device1.measurement1_int2 > > > > > > root.sg.device1.measurement2_long > > > > > > > > > > I think the more efficient way is: > > > > > > > > > > root.sg.device1.measurement1.0 > > > > > root.sg.device1.measurement1.1 > > > > > root.sg.device1.measurement1.2 > > > > > root.sg.device1.measurement2 > > > > > > > > > > And, as you said "a device has a sensor that collects some data > > in > > > > array > > > > > format (int[3]) and some in long type", > > > > > will the user query just one element from the int[3]? If not, a > > > > better > > > > > schema is: > > > > > > > > > > root.sg.device1.measurement1 (the dataType is int[]) > > > > > root.sg.device1.measurement2 (the dataType is long) > > > > > > > > > > Best, > > > > > ----------------------------------- > > > > > Xiangdong Huang > > > > > School of Software, Tsinghua University > > > > > > > > > > 黄向东 > > > > > 清华大学 软件学院 > > > > > > > > > > > > > > > Jialin Qiao <qj...@mails.tsinghua.edu.cn> 于2020年7月20日周一 > > 下午3:28写道: > > > > > > > > > > > Hi > > > > > > > > > > > > Recently, I find that some users create timeseries do not > > > > following the > > > > > > real world semantic of device > > > > > > > > > > > > > > > > > > E.g., a device has a sensor that collects some data in array > > format > > > > > > (int[3]) and some in long type. > > > > > > > > > > > > > > > > > > Many users will create timeseries like this: > > > > > > > > > > > > > > > > > > root.sg.device1.measurement1.int0 > > > > > > root.sg.device1.measurement1.int1 > > > > > > root.sg.device1.measurement1.int2 > > > > > > root.sg.device1.measurement2.long > > > > > > > > > > > > > > > > > > As a consequence, there will be two devices instead of one > > device. > > > > This > > > > > > will cause the real number of devices is much bigger than the > > real > > > > devices > > > > > > they thought. The drawback is: more devices leads to more > > memory > > > > > > consumption. > > > > > > > > > > > > > > > > > > A more efficient way is > > > > > > > > > > > > > > > > > > root.sg.device1.measurement1_int0 > > > > > > root.sg.device1.measurement1_int1 > > > > > > root.sg.device1.measurement1_int2 > > > > > > root.sg.device1.measurement2_long > > > > > > > > > > > > > > > > > > In this schema, there will be only one device and 4 > > measurements. > > > > > > > > > > > > > > > > > > The problem is we extract the device id automatically. Users > > > > usually do > > > > > > not have a clear concept about "device". Should we emphasize > > the > > > > concept of > > > > > > device by letting users create device manually? > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Jialin Qiao > > > > > > School of Software, Tsinghua University > > > > > > > > > > > > 乔嘉林 > > > > > > 清华大学 软件学院 > > > > > > > > > > >