回覆: 回覆: A new encoding method for regular timestamp column

Jack Tsai Wed, 03 Jul 2019 02:31:06 -0700

Hi,

Sorry for sending the unfinished mail (last mail). This is the complete version.


According to the result of the performance test I have done so far, I found 
some difference between the new encoding method I created and the original 
encoding method.

I increased the data size to 2000000 rows (the data size in the original test 
of the old encoding method was around 3000) when testing the new encoding 
method. There are some aspects for the comparison:


  1.  Compression ratio
  2.  Write (encode) operation time
  3.  Read (decode) operation time

The following is the comparison between the old encoding method and the new 
encoding method:

  *   Missing a data point every 10 points (10% missing point percentage)

Row number: 2332801
Missing point percentage: 0.10000038580230375
source data size:16796160 byte
Write time: 296
encoding data size:2994679 byte
Read time: 173
Compression ratio: 5.6086679073116015


  *   Missing a data point every 20 points (5% missing point percentage)

row number: 2332801
missing point percentage: 0.05000040723576504
source data size:17729280 byte
write time: 287
encoding data size:3161045 byte
read time: 195
compression ratio: 5.608676877425029


  *   Missing a data point every 80 points (1% missing point percentage)

row number: 2332801
missing point percentage: 0.012500423310861097
source data size:18429120 byte
write time: 291
encoding data size:3285820 byte
read time: 189
compression ratio: 5.608682155443694

  *   Missing a data point every 1700 points (0.0005% missing point percentage)

Row number: 2332801
Missing point percentage: 5.885628478382587E-4
source data size:18651424 byte
Write time: 199
encoding data size:651696 byte
Read time: 94
Compression ratio: 28.619822739436792

  *   Missing a data point every 40000 points (0.00002% missing point 
percentage)

row number: 2332801
missing point percentage: 2.5291484357259364E-5
source data size:18661936 byte
write time: 232
encoding data size:443136 byte
read time: 69
compression ratio: 42.11333766608897



There are few testing results for the new encoding method as follows:

  *   Missing a data point every 10 points (10% missing point percentage)

Row number: 2332801
Missing point percentage: 0.10000038580230375
source data size:16796160 byte
Write time: 117
encoding data size:653285 byte
Read time: 84
Compression ratio: 25.71031020152001

  *   Missing a data point every 20 points (5% missing point percentage)

Row number: 2332801
Missing point percentage: 0.05000040723576504
source data size:17729280 byte
Write time: 171
encoding data size:653285 byte
Read time: 116
Compression ratio: 27.138660768271123

  *   Missing a data point every 80 points (1% missing point percentage)

Row number: 2332801
Missing point percentage: 0.012500423310861097
source data size:18429120 byte
Write time: 128
encoding data size:653285 byte
Read time: 89
Compression ratio: 28.209923693334456

  *   Missing a data point every 1700 points (0.0005% missing point percentage)

Row number: 2332801
Missing point percentage: 5.885628478382587E-4
source data size:18651424 byte
Write time: 118
encoding data size:653285 byte
Read time: 161
Compression ratio: 28.550210092073137

  *   Missing a data point every 40000 points (0.00002% missing point 
percentage)

Row number: 2332801
Missing point percentage: 0.002142488793514752
source data size:18622424 byte
Write time: 110
encoding data size:653286 byte
Read time: 81
Compression ratio: 28.50577541842317

As you could see above, the compression ratio would increase along with the 
decreasing of the missing point percentage.

In the high missing point percentage, the new encoding method has the better 
performance on compressing data.

While the missing point percentage came down to 0.0005%, both compression 
performance of the encoding method would be almost the same.

However, as the missing point percentage keep decreasing, the compression ratio 
in the new encoding method almost won't increase anymore. On the contrary, the 
performance of the original encoding method still improving along with the 
decreasing of the missing point percentage.

In the last comparison of the compression ratio, the original encoding method 
could have the compression ratio up to 40x, which almost has the same 
performance as encoding the data without missing point. But in the new encoding 
method, the compression ratio still remain on 20x.

The write time and the read time is also different between two encoding method. 
With using the new encoding method, the write (encode) performance has made an 
improvement.

In conclusion, the new encoding method could be more adapted to encode when the 
missing point percentage is high (above 1%), but in the lower missing point 
percentage (which could be stated as under 1%), the original encoding method 
may have better performance in the compression.

As a result, do I need to make any changes to my implementation on the new 
encoding method? Does this result meet what this issue need?

If you have some problems on my testing or implementation, please welcome to 
give me some advice.

Best regards,
Tsung-Han Tsai
________________________________
寄件者: Jack Tsai
寄件日期: 2019年7月3日 下午 03:58
收件者: [email protected]
主旨: 回覆: 回覆: A new encoding method for regular timestamp column

Hi,

According to the result of the performance test I have done so far, I found 
some difference between the new encoding method I created and the original 
encoding method.

I increased the data size to 2000000 rows (the data size in the original test 
of the old encoding method was around 3000) when testing the new encoding 
method. There are some aspects for the comparison:


  1.  Compression ratio
  2.  Write (encode) operation time
  3.  Read (decode) operation time

I found that in the original encoding method, the determining factor for the 
compression performance is not the number of the missing point in the data. 
Actually, it is the density of the missing point. I would provide some examples 
for you to understand.


  *   Missing a data point every 500 points

________________________________
寄件者: Jialin Qiao <[email protected]>
寄件日期: 2019年6月28日 上午 10:35
收件者: [email protected]
主旨: Re: 回覆: A new encoding method for regular timestamp column

Hi,

The bitmap is also needed. Just use another encoding for regular time column.

Suppose the time column is "2, 3, 4, 6". The generated column is "2, 3, 4, 5, 
6".

Then we could use "2(the first value), 1(the delta), 5(total data number)" to 
encode the original data.

However, when decoding from "2, 1, 5" and get "2, 3, 4, 5, 6", we still need a 
mark column such that "11101" to denote that the fourth data is generated.

Best,
--
Jialin Qiao
School of Software, Tsinghua University

乔嘉林
清华大学 软件学院

> -----原始邮件-----
> 发件人: "Jack Tsai" <[email protected]>
> 发送时间: 2019-06-28 09:54:18 (星期五)
> 收件人: "[email protected]" <[email protected]>
> 抄送:
> 主题: 回覆: A new encoding method for regular timestamp column
>
> Hi Jialin,
>
> I am not sure if my understanding is right.
>
> Does it mean that to encode the data without the bitmap?
>
> Best regards,
> Tsung-Han Tsai
>
> ________________________________
> 寄件者: Jialin Qiao <[email protected]>
> 寄件日期: 2019年6月28日 上午 08:59
> 收件者: [email protected]
> 主旨: Re: A new encoding method for regular timestamp column
>
> Hi, Tsung-Han
>
> Nice try!
> I think by "low compression ratio of regular time series (with missing 
> points)" you mean using TS2DIFF to encode the time column.
> I wonder if we could "generate a new data array with no missing points 
> (regular time series)",
> could we just use (1) the first value, (2) the delta and (3) the total number 
> of value to encode the raw data instead of using TS2DIFF?
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Jack Tsai" <[email protected]>
> > 发送时间: 2019-06-27 17:56:45 (星期四)
> > 收件人: "[email protected]" <[email protected]>
> > 抄送:
> > 主题: A new encoding method for regular timestamp column
> >
> > Hi all,
> >
> > I am working on this issue: 
> > https://issues.apache.org/jira/projects/IOTDB/issues/IOTDB-73
> >
> > The new encoding method could deal with the problem of the regular time 
> > series (the time elapsed between each data point is always the same) which 
> > got the missing point. When the missing points exist in the data, the 
> > compression ratio would decrease from 40x to 8x.
> >
> > To solve this issue, here comes my solution. I would divide it into two 
> > parts to explain, which is the write operation and the read operation.
> >
> > Write (Encode)
> >
> >   1.  First of all, calculate the delta between each elements in the data. 
> > If the delta value is different from the previous delta, then it could be 
> > stated that the missing point exist in the data.
> >   2.  If the missing points exist, get the minimum data base between each 
> > element. Generate a new data array with no missing points (regular time 
> > series).
> >   3.  Next, begin to write the info of this data into the byte array output 
> > stream.
> >   4.  The first part is for the identifier, which is the boolean value to 
> > show whether the missing points exist in the following data.
> >   5.  Compare the original data, which is the missing point data with the 
> > new complete data. It would form a bitmap to denote the position of missing 
> > points.
> >   6.  Convert the bitmap into the byte array and write it and its length 
> > into the output stream.
> >   7.  Start to encode the data (the newly created one which has no missing 
> > point) value into the output stream.
> >   8.  When the encode data size reach the target block size, flush it into 
> > the output stream. Repeat until all values in the data flush into the 
> > output stream.
> >
> > Read (Decode)
> >
> >   1.  First of all, decode the first boolean value in the buffer to check 
> > whether the following data has the missing point.
> >   2.  If the missing point exist in the data, then it means there is a byte 
> > array of bitmap in the following part of this buffer.
> >   3.  To decode the bitmap, initially, read the next int value which is for 
> > the length of the following byte array of the bitmap.
> >   4.  Decode the byte array of the bitmap and convert it into the bitmap 
> > for denoting the missing point in the following data.
> >   5.  Decode the following data, which is the data array without missing 
> > points and compare it with the bitmap (according to the mechanism of the 
> > bitmap, when the bit comes to 0, then it means the missing point exists 
> > here. Return (read out) the value in the decoded data when the bit value is 
> > 1).
> >
> > The compression ratio would be up to 20x result in the original unit test 
> > (the one with old encoding method would result in 8x).
> >
> > To compare with the original encoding method more precisely, I would start 
> > the unit test with more details and the performance test for this new 
> > encoding method recently.
> >
> > If you are confused or think there is the problem in my implementation, 
> > please welcome to give me some advice.
> >
> > Best regards,
> > Tsung-Han Tsai

回覆: 回覆: A new encoding method for regular timestamp column

Reply via email to