Re: Suggestions for new TsFile

Jialin Qiao Thu, 13 Feb 2020 00:59:11 -0800

Hi,

I have a suggestion. 
We could add a Statistics in TimeseriesMetadata to support fast aggregations.


Thanks,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "Jialin Qiao" <[email protected]>
> 发送时间: 2020-02-13 16:16:54 (星期四)
> 收件人: [email protected]
> 抄送: 
> 主题: Re: Suggestions for new TsFile
> 
> Hi,
> 
> +1 for most queries contains a time filter
> 
> But I don't know what do you mean by "add a time attribute", add to where?
> 
> Thanks,
> --
> Jialin Qiao
> School of Software, Tsinghua University
> 
> 乔嘉林
> 清华大学 软件学院
> 
> > -----原始邮件-----
> > 发件人: "Dawei Liu" <[email protected]>
> > 发送时间: 2020-02-13 15:55:48 (星期四)
> > 收件人: [email protected]
> > 抄送: 
> > 主题: Re: Suggestions for new TsFile
> > 
> > Hi,
> > 
> > I found another problem, when I execute :  ` SELECT  s1 FROM xx WHERE time 
> > = 1`
> > 
> > In the new TsFile,  need to read the hard drive 3 times,
> > 
> > 1. Read TsFileMetaData
> > 
> > 2. Read the MetaData of all measurement of the device ( TimeSeriesMetaData )
> > 
> > 3. Read the required measurement of the ChunkMetaData and then the time 
> > filter( time = 1 ) can be filter which Chunk can be used 
> > 
> > 
> > 
> > In the current server, most of the time is used for TimeFilter, we read a 
> > lot of metadata information, if the end can not be used, this is a very big 
> > loss.
> > 
> > So I think we should add a time attribute so that we can know if the file 
> > is can’t to use when we first read the hard drive
> > 
> > Thanks
> > 
> > Dawei Liu
> > 
> > 
> > > 2020年2月12日 下午8:03，Dawei Liu <[email protected]> 写道：
> > > 
> > > Hi,
> > > 
> > > I see it. It looks very comfortable.
> > > 
> > > Thanks
> > > 
> > > Dawei Liu
> > > 
> > >> 2020年2月12日 下午6:52，Haonan Hou <[email protected]> 写道：
> > >> 
> > >> Sure, already added.
> > >> 
> > >> Thanks,
> > >> 
> > >> Haonan Hou
> > >> 
> > >>> On Feb 12, 2020, at 5:57 PM, Jialin Qiao <[email protected]> 
> > >>> wrote:
> > >>> 
> > >>> Hi Haonan,
> > >>> 
> > >>> 
> > >>> I can not see the picture, could you please put it in the PR?
> > >>> 
> > >>> Thanks,
> > >>> --
> > >>> Jialin Qiao
> > >>> School of Software, Tsinghua University
> > >>> 
> > >>> 乔嘉林
> > >>> 清华大学 软件学院
> > >>> 
> > >>> -----原始邮件-----
> > >>> 发件人:"Haonan Hou" <[email protected]>
> > >>> 发送时间:2020-02-12 16:27:22 (星期三)
> > >>> 收件人: "[email protected]" <[email protected]>
> > >>> 抄送:
> > >>> 主题: Re: Suggestions for new TsFile
> > >>> 
> > >>> Hi, 
> > >>> 
> > >>> 
> > >>> We have a newer design of TsFile, which combines the suggestions from 
> > >>> Jialin and Dawei. 
> > >>> 
> > >>> 
> > >>> The mean differences is as below:
> > >>> 
> > >>> 
> > >>> 1. Remove TsOffsetArray.
> > >>> 2. Modify the device map in TsFileMetaData to store the start offset of 
> > >>> first TimeseriesMetadata and total data size of all TimeseriesMetadatas 
> > >>> in each device. 
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>> The newer TsFile structure should be looked like:
> > >>> 
> > >>> Here is an example of how the new structure works.
> > >>> 
> > >>> 
> > >>> When we try to get List<ChunkMetadata> of Timeseries "d0.s1", first we 
> > >>> deserialize the map in TsFileMetadata, and we have the startOffset of 
> > >>> TimseriesMetadata “s0", 
> > >>> the first TimeseiresMetadata of “d0", and data size of all 
> > >>> TimeseriesMetadatas in “d0". 
> > >>> 
> > >>> 
> > >>> After that, we are able to deserialize all TimeseriesMetadata in “d0”. 
> > >>> 
> > >>> 
> > >>> Finally we have the TimeseriesMetadata "d0.s1" and can get the 
> > >>> ChunkMetadata List of "d0.s1".
> > >>> 
> > >>> 
> > >>> Thanks,
> > >>> 
> > >>> 
> > >>> Haonan Hou
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>> On Feb 11, 2020, at 8:08 PM, Jialin Qiao <[email protected]> wrote:
> > >>> 
> > >>> 
> > >>> Hi,
> > >>> 
> > >>> If each device only stores each offset of TimeseriesMetadata like this:
> > >>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5] 
> > >>> }, …
> > >>> }
> > >>> 
> > >>> It could be simplified to recording the start offset and end offset:
> > >>> TsFileMetaData ---> [ {deviceId(d0), [0, 2] }, {deviceId(d1), [3,5] }, 
> > >>> … }
> > >>> 
> > >>> And finally, it could be replaced by: TsFileMetaData ---> [ 
> > >>> {deviceId(d0),
> > >>> 0 }, {deviceId(d1), 3 }, … }
> > >>> 
> > >>> Thanks,
> > >>> —————————————————
> > >>> Jialin Qiao
> > >>> School of Software, Tsinghua University
> > >>> 
> > >>> 乔嘉林
> > >>> 清华大学 软件学院
> > >>> 
> > >>> 
> > >>> atoiLiu <[email protected]> 于2020年2月11日周二 下午7:59写道：
> > >>> 
> > >>> Hi,
> > >>> 
> > >>> Thank you for your reply.
> > >>> I am very happy that you can take my suggestion.
> > >>> 
> > >>> 
> > >>> Thanks
> > >>> 
> > >>> Dawei Liu
> > >>> 
> > >>> 
> > >>> 2020年2月11日 下午6:04，Haonan Hou <[email protected]> 写道：
> > >>> 
> > >>> Hi Dawei,
> > >>> 
> > >>> Thank you so much that you share your opinion about new TsFile!
> > >>> I am very happy to take your suggestions.
> > >>> 
> > >>> You said we can remove TsOffsetArray and directly store the offset of
> > >>> TimeseriesMetaData. I agree with you. It is better than my version.
> > >>> Besides, for the optimization of TimeserieMetaData, I would like to
> > >>> discuss with other people to determine which way is better.
> > >>> 
> > >>> Best,
> > >>> 
> > >>> Haonan Hou
> > >>> 
> > >>> 
> > >>> On Feb 11, 2020, at 5:35 PM, atoiLiu <[email protected]> wrote:
> > >>> 
> > >>> Hi,
> > >>> 
> > >>> I’m learning new TsFile in PR [1], but I think TsFileMetaData has a bad
> > >>> design.
> > >>> 
> > >>> TsFileMetaData has a TsOffsetArray,  TsOffsetArray is record every
> > >>> offset of TimeseriesMetaData, and use Map<deviceId, int[]> to record
> > >>> startIndex , endIndex of TsOffsetArray, it’s looks like :
> > >>> 
> > >>> TsFileMetaData —>{ [0,1,2,3,4,5, ….] [ {deviceId(d0), [0,2] },
> > >>> {deviceId(d1), [3,5] }, …. } }
> > >>> 
> > >>> We can delete TsOffsetArray  and store the offsets directly in the
> > >>> deviceIndexArray, then TsFileMatadata will has a Map<deviceId, 
> > >>> List<Long>>
> > >>> to record . This change will save 4 bytes per device on disk, because 
> > >>> every
> > >>> device just need record the number of offsets and offsets. it’s looks 
> > >>> like：
> > >>> 
> > >>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5]
> > >>> }, … }
> > >>> 
> > >>> 
> > >>> In addition, TimeSeriesMetaData is an ordered structure on the hard
> > >>> disk, and the TimeSeriesMetaData for each device is linked together, so
> > >>> TsFileMetaData does not need to store all offset information, so there 
> > >>> two
> > >>> optimization directions:
> > >>> 
> > >>> 1. Save startTime , endTime and offset for each TimeSeriesMetaData in
> > >>> TsFileMetaData. The nice thing about this is that when you read
> > >>> TsFileMetaData from your hard drive, you can directly do a filter to 
> > >>> filter
> > >>> which TimeSeriesMetaData is not necessary to read.
> > >>> 
> > >>> 
> > >>> 2. Only save the start TimeSeriesMetaData offset in TsFileMetaData so
> > >>> that you can loop through it and just need once to seek, it’s looks 
> > >>> like :
> > >>> 
> > >>> TsFileMetaData ---> [ {deviceId(d0), 0 }, {deviceId(d1), 3 }, … }
> > >>> 
> > >>> 
> > >>> 
> > >>> [1] https://github.com/apache/incubator-iotdb/pull/736 <
> > >>> https://github.com/apache/incubator-iotdb/pull/736>
> > >>> 
> > >>> Thanks
> > >>> 
> > >>> Dawei Liu
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>> 
> > >>

Re: Suggestions for new TsFile

Reply via email to