Re: 回覆: A new encoding method for regular timestamp column

Xiangdong Huang Fri, 28 Jun 2019 07:07:22 -0700

Hi,

+1.


Bitmap is for indicating which timestamp is faked, and the DIFF encoding
(i.e., what as Jialin said from  2, 3, 4, 5, 6 to 2, 1, 5) is for
compressing timestamps.

Besides, I'd like to know when this new encoding methods take effect?
For example, "if we generate timestamps that less than x% of the total data
points in the Page, then we will benefit  and the compression ratio can be
improved y%".

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jialin Qiao <[email protected]> 于2019年6月28日周五 上午10:35写道：

> Hi,
>
> The bitmap is also needed. Just use another encoding for regular time
> column.
>
> Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4,
> 5, 6".
>
> Then we could use "2(the first value), 1(the delta), 5(total data number)"
> to encode the original data.
>
> However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still
> need a mark column such that "11101" to denote that the fourth data is
> generated.
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <[email protected]>
> > 发送时间: 2019-06-28 09:54:18 (星期五)
> > 收件人: "[email protected]" <[email protected]>
> > 抄送:
> > 主题: 回覆: A new encoding method for regular timestamp column
> >
> > Hi Jialin,
> >
> > I am not sure if my understanding is right.
> >
> > Does it mean that to encode the data without the bitmap?
> >
> > Best regards,
> > Tsung-Han Tsai
> >
> > ________________________________
> > 寄件者: Jialin Qiao <[email protected]>
> > 寄件日期: 2019年6月28日 上午 08:59
> > 收件者: [email protected]
> > 主旨: Re: A new encoding method for regular timestamp column
> >
> > Hi, Tsung-Han
> >
> > Nice try!
> > I think by "low compression ratio of regular time series (with missing
> points)" you mean using TS2DIFF to encode the time column.
> > I wonder if we could "generate a new data array with no missing points
> (regular time series)",
> > could we just use (1) the first value, (2) the delta and (3) the total
> number of value to encode the raw data instead of using TS2DIFF?
> >
> > Best,
> > --
> > Jialin Qiao
> > School of Software, Tsinghua University
> >
> > 乔嘉林
> > 清华大学 软件学院
> >
> > > -----原始邮件-----
> > > 发件人: "Jack Tsai" <[email protected]>
> > > 发送时间: 2019-06-27 17:56:45 (星期四)
> > > 收件人: "[email protected]" <[email protected]>
> > > 抄送:
> > > 主题: A new encoding method for regular timestamp column
> > >
> > > Hi all,
> > >
> > > I am working on this issue:
> https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> > >
> > > The new encoding method could deal with the problem of the regular
> time series (the time elapsed between each data point is always the same)
> which got the missing point. When the missing points exist in the data, the
> compression ratio would decrease from 40x to 8x.
> > >
> > > To solve this issue, here comes my solution. I would divide it into
> two parts to explain, which is the write operation and the read operation.
> > >
> > > Write (Encode)
> > >
> > >   1.  First of all, calculate the delta between each elements in the
> data. If the delta value is different from the previous delta, then it
> could be stated that the missing point exist in the data.
> > >   2.  If the missing points exist, get the minimum data base between
> each element. Generate a new data array with no missing points (regular
> time series).
> > >   3.  Next, begin to write the info of this data into the byte array
> output stream.
> > >   4.  The first part is for the identifier, which is the boolean value
> to show whether the missing points exist in the following data.
> > >   5.  Compare the original data, which is the missing point data with
> the new complete data. It would form a bitmap to denote the position of
> missing points.
> > >   6.  Convert the bitmap into the byte array and write it and its
> length into the output stream.
> > >   7.  Start to encode the data (the newly created one which has no
> missing point) value into the output stream.
> > >   8.  When the encode data size reach the target block size, flush it
> into the output stream. Repeat until all values in the data flush into the
> output stream.
> > >
> > > Read (Decode)
> > >
> > >   1.  First of all, decode the first boolean value in the buffer to
> check whether the following data has the missing point.
> > >   2.  If the missing point exist in the data, then it means there is a
> byte array of bitmap in the following part of this buffer.
> > >   3.  To decode the bitmap, initially, read the next int value which
> is for the length of the following byte array of the bitmap.
> > >   4.  Decode the byte array of the bitmap and convert it into the
> bitmap for denoting the missing point in the following data.
> > >   5.  Decode the following data, which is the data array without
> missing points and compare it with the bitmap (according to the mechanism
> of the bitmap, when the bit comes to 0, then it means the missing point
> exists here. Return (read out) the value in the decoded data when the bit
> value is 1).
> > >
> > > The compression ratio would be up to 20x result in the original unit
> test (the one with old encoding method would result in 8x).
> > >
> > > To compare with the original encoding method more precisely, I would
> start the unit test with more details and the performance test for this new
> encoding method recently.
> > >
> > > If you are confused or think there is the problem in my
> implementation, please welcome to give me some advice.
> > >
> > > Best regards,
> > > Tsung-Han Tsai
>

Re: 回覆: A new encoding method for regular timestamp column

Reply via email to