Re: Suggestions for new TsFile

Dawei Liu Thu, 13 Feb 2020 01:22:12 -0800

Hi，

Sorry，i overlooked that the first step in server was to filter files through 
startTime/endTime Map.


+1 for add a Statistics in TimeseriesMetadata， 

For example:
Device Shadow (设备影子) , it is often necessary to find the last information about 
a device

Thanks

Dawei Liu

> 2020年2月13日 下午4:58，Jialin Qiao <[email protected]> 写道：
> 
> Hi,
> 
> I have a suggestion. 
> We could add a Statistics in TimeseriesMetadata to support fast aggregations.
> 
> Thanks,
> --
> Jialin Qiao
> School of Software, Tsinghua University
> 
> 乔嘉林
> 清华大学 软件学院
> 
>> -----原始邮件-----
>> 发件人: "Jialin Qiao" <[email protected]>
>> 发送时间: 2020-02-13 16:16:54 (星期四)
>> 收件人: [email protected]
>> 抄送: 
>> 主题: Re: Suggestions for new TsFile
>> 
>> Hi,
>> 
>> +1 for most queries contains a time filter
>> 
>> But I don't know what do you mean by "add a time attribute", add to where?
>> 
>> Thanks,
>> --
>> Jialin Qiao
>> School of Software, Tsinghua University
>> 
>> 乔嘉林
>> 清华大学 软件学院
>> 
>>> -----原始邮件-----
>>> 发件人: "Dawei Liu" <[email protected]>
>>> 发送时间: 2020-02-13 15:55:48 (星期四)
>>> 收件人: [email protected]
>>> 抄送: 
>>> 主题: Re: Suggestions for new TsFile
>>> 
>>> Hi,
>>> 
>>> I found another problem, when I execute :  ` SELECT  s1 FROM xx WHERE time 
>>> = 1`
>>> 
>>> In the new TsFile,  need to read the hard drive 3 times,
>>> 
>>> 1. Read TsFileMetaData
>>> 
>>> 2. Read the MetaData of all measurement of the device ( TimeSeriesMetaData )
>>> 
>>> 3. Read the required measurement of the ChunkMetaData and then the time 
>>> filter( time = 1 ) can be filter which Chunk can be used 
>>> 
>>> 
>>> 
>>> In the current server, most of the time is used for TimeFilter, we read a 
>>> lot of metadata information, if the end can not be used, this is a very big 
>>> loss.
>>> 
>>> So I think we should add a time attribute so that we can know if the file 
>>> is can’t to use when we first read the hard drive
>>> 
>>> Thanks
>>> 
>>> Dawei Liu
>>> 
>>> 
>>>> 2020年2月12日 下午8:03，Dawei Liu <[email protected]> 写道：
>>>> 
>>>> Hi,
>>>> 
>>>> I see it. It looks very comfortable.
>>>> 
>>>> Thanks
>>>> 
>>>> Dawei Liu
>>>> 
>>>>> 2020年2月12日 下午6:52，Haonan Hou <[email protected]> 写道：
>>>>> 
>>>>> Sure, already added.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Haonan Hou
>>>>> 
>>>>>> On Feb 12, 2020, at 5:57 PM, Jialin Qiao <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi Haonan,
>>>>>> 
>>>>>> 
>>>>>> I can not see the picture, could you please put it in the PR?
>>>>>> 
>>>>>> Thanks,
>>>>>> --
>>>>>> Jialin Qiao
>>>>>> School of Software, Tsinghua University
>>>>>> 
>>>>>> 乔嘉林
>>>>>> 清华大学 软件学院
>>>>>> 
>>>>>> -----原始邮件-----
>>>>>> 发件人:"Haonan Hou" <[email protected]>
>>>>>> 发送时间:2020-02-12 16:27:22 (星期三)
>>>>>> 收件人: "[email protected]" <[email protected]>
>>>>>> 抄送:
>>>>>> 主题: Re: Suggestions for new TsFile
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> 
>>>>>> We have a newer design of TsFile, which combines the suggestions from 
>>>>>> Jialin and Dawei. 
>>>>>> 
>>>>>> 
>>>>>> The mean differences is as below:
>>>>>> 
>>>>>> 
>>>>>> 1. Remove TsOffsetArray.
>>>>>> 2. Modify the device map in TsFileMetaData to store the start offset of 
>>>>>> first TimeseriesMetadata and total data size of all TimeseriesMetadatas 
>>>>>> in each device. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> The newer TsFile structure should be looked like:
>>>>>> 
>>>>>> Here is an example of how the new structure works.
>>>>>> 
>>>>>> 
>>>>>> When we try to get List<ChunkMetadata> of Timeseries "d0.s1", first we 
>>>>>> deserialize the map in TsFileMetadata, and we have the startOffset of 
>>>>>> TimseriesMetadata “s0", 
>>>>>> the first TimeseiresMetadata of “d0", and data size of all 
>>>>>> TimeseriesMetadatas in “d0". 
>>>>>> 
>>>>>> 
>>>>>> After that, we are able to deserialize all TimeseriesMetadata in “d0”. 
>>>>>> 
>>>>>> 
>>>>>> Finally we have the TimeseriesMetadata "d0.s1" and can get the 
>>>>>> ChunkMetadata List of "d0.s1".
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> 
>>>>>> Haonan Hou
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Feb 11, 2020, at 8:08 PM, Jialin Qiao <[email protected]> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> If each device only stores each offset of TimeseriesMetadata like this:
>>>>>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5] 
>>>>>> }, …
>>>>>> }
>>>>>> 
>>>>>> It could be simplified to recording the start offset and end offset:
>>>>>> TsFileMetaData ---> [ {deviceId(d0), [0, 2] }, {deviceId(d1), [3,5] }, … 
>>>>>> }
>>>>>> 
>>>>>> And finally, it could be replaced by: TsFileMetaData ---> [ 
>>>>>> {deviceId(d0),
>>>>>> 0 }, {deviceId(d1), 3 }, … }
>>>>>> 
>>>>>> Thanks,
>>>>>> —————————————————
>>>>>> Jialin Qiao
>>>>>> School of Software, Tsinghua University
>>>>>> 
>>>>>> 乔嘉林
>>>>>> 清华大学 软件学院
>>>>>> 
>>>>>> 
>>>>>> atoiLiu <[email protected]> 于2020年2月11日周二 下午7:59写道：
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Thank you for your reply.
>>>>>> I am very happy that you can take my suggestion.
>>>>>> 
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Dawei Liu
>>>>>> 
>>>>>> 
>>>>>> 2020年2月11日 下午6:04，Haonan Hou <[email protected]> 写道：
>>>>>> 
>>>>>> Hi Dawei,
>>>>>> 
>>>>>> Thank you so much that you share your opinion about new TsFile!
>>>>>> I am very happy to take your suggestions.
>>>>>> 
>>>>>> You said we can remove TsOffsetArray and directly store the offset of
>>>>>> TimeseriesMetaData. I agree with you. It is better than my version.
>>>>>> Besides, for the optimization of TimeserieMetaData, I would like to
>>>>>> discuss with other people to determine which way is better.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Haonan Hou
>>>>>> 
>>>>>> 
>>>>>> On Feb 11, 2020, at 5:35 PM, atoiLiu <[email protected]> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I’m learning new TsFile in PR [1], but I think TsFileMetaData has a bad
>>>>>> design.
>>>>>> 
>>>>>> TsFileMetaData has a TsOffsetArray,  TsOffsetArray is record every
>>>>>> offset of TimeseriesMetaData, and use Map<deviceId, int[]> to record
>>>>>> startIndex , endIndex of TsOffsetArray, it’s looks like :
>>>>>> 
>>>>>> TsFileMetaData —>{ [0,1,2,3,4,5, ….] [ {deviceId(d0), [0,2] },
>>>>>> {deviceId(d1), [3,5] }, …. } }
>>>>>> 
>>>>>> We can delete TsOffsetArray  and store the offsets directly in the
>>>>>> deviceIndexArray, then TsFileMatadata will has a Map<deviceId, 
>>>>>> List<Long>>
>>>>>> to record . This change will save 4 bytes per device on disk, because 
>>>>>> every
>>>>>> device just need record the number of offsets and offsets. it’s looks 
>>>>>> like：
>>>>>> 
>>>>>> TsFileMetaData ---> [ {deviceId(d0), [0,1,2] }, {deviceId(d1), [3,4,5]
>>>>>> }, … }
>>>>>> 
>>>>>> 
>>>>>> In addition, TimeSeriesMetaData is an ordered structure on the hard
>>>>>> disk, and the TimeSeriesMetaData for each device is linked together, so
>>>>>> TsFileMetaData does not need to store all offset information, so there 
>>>>>> two
>>>>>> optimization directions:
>>>>>> 
>>>>>> 1. Save startTime , endTime and offset for each TimeSeriesMetaData in
>>>>>> TsFileMetaData. The nice thing about this is that when you read
>>>>>> TsFileMetaData from your hard drive, you can directly do a filter to 
>>>>>> filter
>>>>>> which TimeSeriesMetaData is not necessary to read.
>>>>>> 
>>>>>> 
>>>>>> 2. Only save the start TimeSeriesMetaData offset in TsFileMetaData so
>>>>>> that you can loop through it and just need once to seek, it’s looks like 
>>>>>> :
>>>>>> 
>>>>>> TsFileMetaData ---> [ {deviceId(d0), 0 }, {deviceId(d1), 3 }, … }
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> [1] https://github.com/apache/incubator-iotdb/pull/736 <
>>>>>> https://github.com/apache/incubator-iotdb/pull/736>
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> Dawei Liu
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>

Re: Suggestions for new TsFile

Reply via email to