Hi Jialin, I am not sure if my understanding is right.
Does it mean that to encode the data without the bitmap? Best regards, Tsung-Han Tsai ________________________________ 寄件者: Jialin Qiao <[email protected]> 寄件日期: 2019年6月28日 上午 08:59 收件者: [email protected] 主旨: Re: A new encoding method for regular timestamp column Hi, Tsung-Han Nice try! I think by "low compression ratio of regular time series (with missing points)" you mean using TS2DIFF to encode the time column. I wonder if we could "generate a new data array with no missing points (regular time series)", could we just use (1) the first value, (2) the delta and (3) the total number of value to encode the raw data instead of using TS2DIFF? Best, -- Jialin Qiao School of Software, Tsinghua University 乔嘉林 清华大学 软件学院 > -----原始邮件----- > 发件人: "Jack Tsai" <[email protected]> > 发送时间: 2019-06-27 17:56:45 (星期四) > 收件人: "[email protected]" <[email protected]> > 抄送: > 主题: A new encoding method for regular timestamp column > > Hi all, > > I am working on this issue: > https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73 > > The new encoding method could deal with the problem of the regular time > series (the time elapsed between each data point is always the same) which > got the missing point. When the missing points exist in the data, the > compression ratio would decrease from 40x to 8x. > > To solve this issue, here comes my solution. I would divide it into two parts > to explain, which is the write operation and the read operation. > > Write (Encode) > > 1. First of all, calculate the delta between each elements in the data. If > the delta value is different from the previous delta, then it could be stated > that the missing point exist in the data. > 2. If the missing points exist, get the minimum data base between each > element. Generate a new data array with no missing points (regular time > series). > 3. Next, begin to write the info of this data into the byte array output > stream. > 4. The first part is for the identifier, which is the boolean value to > show whether the missing points exist in the following data. > 5. Compare the original data, which is the missing point data with the new > complete data. It would form a bitmap to denote the position of missing > points. > 6. Convert the bitmap into the byte array and write it and its length into > the output stream. > 7. Start to encode the data (the newly created one which has no missing > point) value into the output stream. > 8. When the encode data size reach the target block size, flush it into > the output stream. Repeat until all values in the data flush into the output > stream. > > Read (Decode) > > 1. First of all, decode the first boolean value in the buffer to check > whether the following data has the missing point. > 2. If the missing point exist in the data, then it means there is a byte > array of bitmap in the following part of this buffer. > 3. To decode the bitmap, initially, read the next int value which is for > the length of the following byte array of the bitmap. > 4. Decode the byte array of the bitmap and convert it into the bitmap for > denoting the missing point in the following data. > 5. Decode the following data, which is the data array without missing > points and compare it with the bitmap (according to the mechanism of the > bitmap, when the bit comes to 0, then it means the missing point exists here. > Return (read out) the value in the decoded data when the bit value is 1). > > The compression ratio would be up to 20x result in the original unit test > (the one with old encoding method would result in 8x). > > To compare with the original encoding method more precisely, I would start > the unit test with more details and the performance test for this new > encoding method recently. > > If you are confused or think there is the problem in my implementation, > please welcome to give me some advice. > > Best regards, > Tsung-Han Tsai
