Hi,

I found another problem, when I execute :  ` SELECT  s1 FROM xx WHERE time = 1`

In the new TsFile,  need to read the hard drive 3 times,

1. Read TsFileMetaData

2. Read the MetaData of all measurement of the device ( TimeSeriesMetaData )

3. Read the required measurement of the ChunkMetaData and then the time filter( 
time = 1 ) can be filter which Chunk can be used 



In the current server, most of the time is used for TimeFilter, we read a lot 
of metadata information, if the end can not be used, this is a very big loss.

So I think we should add a time attribute so that we can know if the file is 
can’t to use when we first read the hard drive

Thanks

Dawei Liu


> 2020年2月12日 下午8:03,Dawei Liu <[email protected]> 写道:
> 
> Hi,
> 
> I see it. It looks very comfortable.
> 
> Thanks
> 
> Dawei Liu
> 
>> 2020年2月12日 下午6:52,Haonan Hou <[email protected]> 写道:
>> 
>> Sure, already added.
>> 
>> Thanks,
>> 
>> Haonan Hou
>> 
>>> On Feb 12, 2020, at 5:57 PM, Jialin Qiao <[email protected]> 
>>> wrote:
>>> 
>>> Hi Haonan,
>>> 
>>> 
>>> I can not see the picture, could you please put it in the PR?
>>> 
>>> Thanks,
>>> --
>>> Jialin Qiao
>>> School of Software, Tsinghua University
>>> 
>>> 乔嘉林
>>> 清华大学 软件学院
>>> 
>>> -----原始邮件-----
>>> 发件人:"Haonan Hou" <[email protected]>
>>> 发送时间:2020-02-12 16:27:22 (星期三)
>>> 收件人: "[email protected]" <[email protected]>
>>> 抄送:
>>> 主题: Re: Suggestions for new TsFile
>>> 
>>> Hi, 
>>> 
>>> 
>>> We have a newer design of TsFile, which combines the suggestions from 
>>> Jialin and Dawei. 
>>> 
>>> 
>>> The mean differences is as below:
>>> 
>>> 
>>> 1. Remove TsOffsetArray.
>>> 2. Modify the device map in TsFileMetaData to store the start offset of 
>>> first TimeseriesMetadata and total data size of all TimeseriesMetadatas in 
>>> each device. 
>>> 
>>> 
>>> 
>>> 
>>> The newer TsFile structure should be looked like:
>>> 
>>> Here is an example of how the new structure works.
>>> 
>>> 
>>> When we try to get List<ChunkMetadata> of Timeseries "d0.s1", first we 
>>> deserialize the map in TsFileMetadata, and we have the startOffset of 
>>> TimseriesMetadata “s0", 
>>> the first TimeseiresMetadata of “d0", and data size of all 
>>> TimeseriesMetadatas in “d0". 
>>> 
>>> 
>>> After that, we are able to deserialize all TimeseriesMetadata in “d0”. 
>>> 
>>> 
>>> Finally we have the TimeseriesMetadata "d0.s1" and can get the 
>>> ChunkMetadata List of "d0.s1".
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> Haonan Hou
>>> 
>>> 
>>> 
>>> 
>>> On Feb 11, 2020, at 8:08 PM, Jialin Qiao <[email protected]> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> If each device only stores each offset of TimeseriesMetadata like this:
>>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5] }, …
>>> }
>>> 
>>> It could be simplified to recording the start offset and end offset:
>>> TsFileMetaData ---> [ {deviceId(d0), [0, 2] }, {deviceId(d1), [3,5] }, … }
>>> 
>>> And finally, it could be replaced by: TsFileMetaData ---> [ {deviceId(d0),
>>> 0 }, {deviceId(d1), 3 }, … }
>>> 
>>> Thanks,
>>> —————————————————
>>> Jialin Qiao
>>> School of Software, Tsinghua University
>>> 
>>> 乔嘉林
>>> 清华大学 软件学院
>>> 
>>> 
>>> atoiLiu <[email protected]> 于2020年2月11日周二 下午7:59写道:
>>> 
>>> Hi,
>>> 
>>> Thank you for your reply.
>>> I am very happy that you can take my suggestion.
>>> 
>>> 
>>> Thanks
>>> 
>>> Dawei Liu
>>> 
>>> 
>>> 2020年2月11日 下午6:04,Haonan Hou <[email protected]> 写道:
>>> 
>>> Hi Dawei,
>>> 
>>> Thank you so much that you share your opinion about new TsFile!
>>> I am very happy to take your suggestions.
>>> 
>>> You said we can remove TsOffsetArray and directly store the offset of
>>> TimeseriesMetaData. I agree with you. It is better than my version.
>>> Besides, for the optimization of TimeserieMetaData, I would like to
>>> discuss with other people to determine which way is better.
>>> 
>>> Best,
>>> 
>>> Haonan Hou
>>> 
>>> 
>>> On Feb 11, 2020, at 5:35 PM, atoiLiu <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I’m learning new TsFile in PR [1], but I think TsFileMetaData has a bad
>>> design.
>>> 
>>> TsFileMetaData has a TsOffsetArray,  TsOffsetArray is record every
>>> offset of TimeseriesMetaData, and use Map<deviceId, int[]> to record
>>> startIndex , endIndex of TsOffsetArray, it’s looks like :
>>> 
>>> TsFileMetaData —>{ [0,1,2,3,4,5, ….] [ {deviceId(d0), [0,2] },
>>> {deviceId(d1), [3,5] }, …. } }
>>> 
>>> We can delete TsOffsetArray  and store the offsets directly in the
>>> deviceIndexArray, then TsFileMatadata will has a Map<deviceId, List<Long>>
>>> to record . This change will save 4 bytes per device on disk, because every
>>> device just need record the number of offsets and offsets. it’s looks like:
>>> 
>>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5]
>>> }, … }
>>> 
>>> 
>>> In addition, TimeSeriesMetaData is an ordered structure on the hard
>>> disk, and the TimeSeriesMetaData for each device is linked together, so
>>> TsFileMetaData does not need to store all offset information, so there two
>>> optimization directions:
>>> 
>>> 1. Save startTime , endTime and offset for each TimeSeriesMetaData in
>>> TsFileMetaData. The nice thing about this is that when you read
>>> TsFileMetaData from your hard drive, you can directly do a filter to filter
>>> which TimeSeriesMetaData is not necessary to read.
>>> 
>>> 
>>> 2. Only save the start TimeSeriesMetaData offset in TsFileMetaData so
>>> that you can loop through it and just need once to seek, it’s looks like :
>>> 
>>> TsFileMetaData ---> [ {deviceId(d0), 0 }, {deviceId(d1), 3 }, … }
>>> 
>>> 
>>> 
>>> [1] https://github.com/apache/incubator-iotdb/pull/736 <
>>> https://github.com/apache/incubator-iotdb/pull/736>
>>> 
>>> Thanks
>>> 
>>> Dawei Liu
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

Reply via email to