Hi, I have a suggestion. We could add a Statistics in TimeseriesMetadata to support fast aggregations.
Thanks, -- Jialin Qiao School of Software, Tsinghua University 乔嘉林 清华大学 软件学院 > -----原始邮件----- > 发件人: "Jialin Qiao" <[email protected]> > 发送时间: 2020-02-13 16:16:54 (星期四) > 收件人: [email protected] > 抄送: > 主题: Re: Suggestions for new TsFile > > Hi, > > +1 for most queries contains a time filter > > But I don't know what do you mean by "add a time attribute", add to where? > > Thanks, > -- > Jialin Qiao > School of Software, Tsinghua University > > 乔嘉林 > 清华大学 软件学院 > > > -----原始邮件----- > > 发件人: "Dawei Liu" <[email protected]> > > 发送时间: 2020-02-13 15:55:48 (星期四) > > 收件人: [email protected] > > 抄送: > > 主题: Re: Suggestions for new TsFile > > > > Hi, > > > > I found another problem, when I execute : ` SELECT s1 FROM xx WHERE time > > = 1` > > > > In the new TsFile, need to read the hard drive 3 times, > > > > 1. Read TsFileMetaData > > > > 2. Read the MetaData of all measurement of the device ( TimeSeriesMetaData ) > > > > 3. Read the required measurement of the ChunkMetaData and then the time > > filter( time = 1 ) can be filter which Chunk can be used > > > > > > > > In the current server, most of the time is used for TimeFilter, we read a > > lot of metadata information, if the end can not be used, this is a very big > > loss. > > > > So I think we should add a time attribute so that we can know if the file > > is can’t to use when we first read the hard drive > > > > Thanks > > > > Dawei Liu > > > > > > > 2020年2月12日 下午8:03,Dawei Liu <[email protected]> 写道: > > > > > > Hi, > > > > > > I see it. It looks very comfortable. > > > > > > Thanks > > > > > > Dawei Liu > > > > > >> 2020年2月12日 下午6:52,Haonan Hou <[email protected]> 写道: > > >> > > >> Sure, already added. > > >> > > >> Thanks, > > >> > > >> Haonan Hou > > >> > > >>> On Feb 12, 2020, at 5:57 PM, Jialin Qiao <[email protected]> > > >>> wrote: > > >>> > > >>> Hi Haonan, > > >>> > > >>> > > >>> I can not see the picture, could you please put it in the PR? > > >>> > > >>> Thanks, > > >>> -- > > >>> Jialin Qiao > > >>> School of Software, Tsinghua University > > >>> > > >>> 乔嘉林 > > >>> 清华大学 软件学院 > > >>> > > >>> -----原始邮件----- > > >>> 发件人:"Haonan Hou" <[email protected]> > > >>> 发送时间:2020-02-12 16:27:22 (星期三) > > >>> 收件人: "[email protected]" <[email protected]> > > >>> 抄送: > > >>> 主题: Re: Suggestions for new TsFile > > >>> > > >>> Hi, > > >>> > > >>> > > >>> We have a newer design of TsFile, which combines the suggestions from > > >>> Jialin and Dawei. > > >>> > > >>> > > >>> The mean differences is as below: > > >>> > > >>> > > >>> 1. Remove TsOffsetArray. > > >>> 2. Modify the device map in TsFileMetaData to store the start offset of > > >>> first TimeseriesMetadata and total data size of all TimeseriesMetadatas > > >>> in each device. > > >>> > > >>> > > >>> > > >>> > > >>> The newer TsFile structure should be looked like: > > >>> > > >>> Here is an example of how the new structure works. > > >>> > > >>> > > >>> When we try to get List<ChunkMetadata> of Timeseries "d0.s1", first we > > >>> deserialize the map in TsFileMetadata, and we have the startOffset of > > >>> TimseriesMetadata “s0", > > >>> the first TimeseiresMetadata of “d0", and data size of all > > >>> TimeseriesMetadatas in “d0". > > >>> > > >>> > > >>> After that, we are able to deserialize all TimeseriesMetadata in “d0”. > > >>> > > >>> > > >>> Finally we have the TimeseriesMetadata "d0.s1" and can get the > > >>> ChunkMetadata List of "d0.s1". > > >>> > > >>> > > >>> Thanks, > > >>> > > >>> > > >>> Haonan Hou > > >>> > > >>> > > >>> > > >>> > > >>> On Feb 11, 2020, at 8:08 PM, Jialin Qiao <[email protected]> wrote: > > >>> > > >>> > > >>> Hi, > > >>> > > >>> If each device only stores each offset of TimeseriesMetadata like this: > > >>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5] > > >>> }, … > > >>> } > > >>> > > >>> It could be simplified to recording the start offset and end offset: > > >>> TsFileMetaData ---> [ {deviceId(d0), [0, 2] }, {deviceId(d1), [3,5] }, > > >>> … } > > >>> > > >>> And finally, it could be replaced by: TsFileMetaData ---> [ > > >>> {deviceId(d0), > > >>> 0 }, {deviceId(d1), 3 }, … } > > >>> > > >>> Thanks, > > >>> ————————————————— > > >>> Jialin Qiao > > >>> School of Software, Tsinghua University > > >>> > > >>> 乔嘉林 > > >>> 清华大学 软件学院 > > >>> > > >>> > > >>> atoiLiu <[email protected]> 于2020年2月11日周二 下午7:59写道: > > >>> > > >>> Hi, > > >>> > > >>> Thank you for your reply. > > >>> I am very happy that you can take my suggestion. > > >>> > > >>> > > >>> Thanks > > >>> > > >>> Dawei Liu > > >>> > > >>> > > >>> 2020年2月11日 下午6:04,Haonan Hou <[email protected]> 写道: > > >>> > > >>> Hi Dawei, > > >>> > > >>> Thank you so much that you share your opinion about new TsFile! > > >>> I am very happy to take your suggestions. > > >>> > > >>> You said we can remove TsOffsetArray and directly store the offset of > > >>> TimeseriesMetaData. I agree with you. It is better than my version. > > >>> Besides, for the optimization of TimeserieMetaData, I would like to > > >>> discuss with other people to determine which way is better. > > >>> > > >>> Best, > > >>> > > >>> Haonan Hou > > >>> > > >>> > > >>> On Feb 11, 2020, at 5:35 PM, atoiLiu <[email protected]> wrote: > > >>> > > >>> Hi, > > >>> > > >>> I’m learning new TsFile in PR [1], but I think TsFileMetaData has a bad > > >>> design. > > >>> > > >>> TsFileMetaData has a TsOffsetArray, TsOffsetArray is record every > > >>> offset of TimeseriesMetaData, and use Map<deviceId, int[]> to record > > >>> startIndex , endIndex of TsOffsetArray, it’s looks like : > > >>> > > >>> TsFileMetaData —>{ [0,1,2,3,4,5, ….] [ {deviceId(d0), [0,2] }, > > >>> {deviceId(d1), [3,5] }, …. } } > > >>> > > >>> We can delete TsOffsetArray and store the offsets directly in the > > >>> deviceIndexArray, then TsFileMatadata will has a Map<deviceId, > > >>> List<Long>> > > >>> to record . This change will save 4 bytes per device on disk, because > > >>> every > > >>> device just need record the number of offsets and offsets. it’s looks > > >>> like: > > >>> > > >>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5] > > >>> }, … } > > >>> > > >>> > > >>> In addition, TimeSeriesMetaData is an ordered structure on the hard > > >>> disk, and the TimeSeriesMetaData for each device is linked together, so > > >>> TsFileMetaData does not need to store all offset information, so there > > >>> two > > >>> optimization directions: > > >>> > > >>> 1. Save startTime , endTime and offset for each TimeSeriesMetaData in > > >>> TsFileMetaData. The nice thing about this is that when you read > > >>> TsFileMetaData from your hard drive, you can directly do a filter to > > >>> filter > > >>> which TimeSeriesMetaData is not necessary to read. > > >>> > > >>> > > >>> 2. Only save the start TimeSeriesMetaData offset in TsFileMetaData so > > >>> that you can loop through it and just need once to seek, it’s looks > > >>> like : > > >>> > > >>> TsFileMetaData ---> [ {deviceId(d0), 0 }, {deviceId(d1), 3 }, … } > > >>> > > >>> > > >>> > > >>> [1] https://github.com/apache/incubator-iotdb/pull/736 < > > >>> https://github.com/apache/incubator-iotdb/pull/736> > > >>> > > >>> Thanks > > >>> > > >>> Dawei Liu > > >>> > > >>> > > >>> > > >>> > > >>> > > >>
