Hi, +1.
Bitmap is for indicating which timestamp is faked, and the DIFF encoding (i.e., what as Jialin said from 2, 3, 4, 5, 6 to 2, 1, 5) is for compressing timestamps. Besides, I'd like to know when this new encoding methods take effect? For example, "if we generate timestamps that less than x% of the total data points in the Page, then we will benefit and the compression ratio can be improved y%". Best, ----------------------------------- Xiangdong Huang School of Software, Tsinghua University 黄向东 清华大学 软件学院 Jialin Qiao <[email protected]> 于2019年6月28日周五 上午10:35写道: > Hi, > > The bitmap is also needed. Just use another encoding for regular time > column. > > Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4, > 5, 6". > > Then we could use "2(the first value), 1(the delta), 5(total data number)" > to encode the original data. > > However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still > need a mark column such that "11101" to denote that the fourth data is > generated. > > Best, > -- > Jialin Qiao > School of Software, Tsinghua University > > 乔嘉林 > 清华大学 软件学院 > > > -----原始邮件----- > > 发件人: "Jack Tsai" <[email protected]> > > 发送时间: 2019-06-28 09:54:18 (星期五) > > 收件人: "[email protected]" <[email protected]> > > 抄送: > > 主题: 回覆: A new encoding method for regular timestamp column > > > > Hi Jialin, > > > > I am not sure if my understanding is right. > > > > Does it mean that to encode the data without the bitmap? > > > > Best regards, > > Tsung-Han Tsai > > > > ________________________________ > > 寄件者: Jialin Qiao <[email protected]> > > 寄件日期: 2019年6月28日 上午 08:59 > > 收件者: [email protected] > > 主旨: Re: A new encoding method for regular timestamp column > > > > Hi, Tsung-Han > > > > Nice try! > > I think by "low compression ratio of regular time series (with missing > points)" you mean using TS2DIFF to encode the time column. > > I wonder if we could "generate a new data array with no missing points > (regular time series)", > > could we just use (1) the first value, (2) the delta and (3) the total > number of value to encode the raw data instead of using TS2DIFF? > > > > Best, > > -- > > Jialin Qiao > > School of Software, Tsinghua University > > > > 乔嘉林 > > 清华大学 软件学院 > > > > > -----原始邮件----- > > > 发件人: "Jack Tsai" <[email protected]> > > > 发送时间: 2019-06-27 17:56:45 (星期四) > > > 收件人: "[email protected]" <[email protected]> > > > 抄送: > > > 主题: A new encoding method for regular timestamp column > > > > > > Hi all, > > > > > > I am working on this issue: > https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73 > > > > > > The new encoding method could deal with the problem of the regular > time series (the time elapsed between each data point is always the same) > which got the missing point. When the missing points exist in the data, the > compression ratio would decrease from 40x to 8x. > > > > > > To solve this issue, here comes my solution. I would divide it into > two parts to explain, which is the write operation and the read operation. > > > > > > Write (Encode) > > > > > > 1. First of all, calculate the delta between each elements in the > data. If the delta value is different from the previous delta, then it > could be stated that the missing point exist in the data. > > > 2. If the missing points exist, get the minimum data base between > each element. Generate a new data array with no missing points (regular > time series). > > > 3. Next, begin to write the info of this data into the byte array > output stream. > > > 4. The first part is for the identifier, which is the boolean value > to show whether the missing points exist in the following data. > > > 5. Compare the original data, which is the missing point data with > the new complete data. It would form a bitmap to denote the position of > missing points. > > > 6. Convert the bitmap into the byte array and write it and its > length into the output stream. > > > 7. Start to encode the data (the newly created one which has no > missing point) value into the output stream. > > > 8. When the encode data size reach the target block size, flush it > into the output stream. Repeat until all values in the data flush into the > output stream. > > > > > > Read (Decode) > > > > > > 1. First of all, decode the first boolean value in the buffer to > check whether the following data has the missing point. > > > 2. If the missing point exist in the data, then it means there is a > byte array of bitmap in the following part of this buffer. > > > 3. To decode the bitmap, initially, read the next int value which > is for the length of the following byte array of the bitmap. > > > 4. Decode the byte array of the bitmap and convert it into the > bitmap for denoting the missing point in the following data. > > > 5. Decode the following data, which is the data array without > missing points and compare it with the bitmap (according to the mechanism > of the bitmap, when the bit comes to 0, then it means the missing point > exists here. Return (read out) the value in the decoded data when the bit > value is 1). > > > > > > The compression ratio would be up to 20x result in the original unit > test (the one with old encoding method would result in 8x). > > > > > > To compare with the original encoding method more precisely, I would > start the unit test with more details and the performance test for this new > encoding method recently. > > > > > > If you are confused or think there is the problem in my > implementation, please welcome to give me some advice. > > > > > > Best regards, > > > Tsung-Han Tsai >
