Optimisation for times series

Nicolas DOUSSINET Mon, 09 Jan 2017 14:41:37 -0800

Hi Phoenix,

I use Phoenix for 1 year and HBase since 2 years.  and I really think  phoenix 
leverage Hbase.. But I'm still surprised that the column oriented storage isn't 
totally used. The dynamic column feature allow you to upsert or select a column 
not created in the create table statement, but you cannot create a block of 
variables columns. Why haven't you invented a feature like this ?


CREATE TABLE (
eventTime  TIMESTAMP NUT NULL,
iotID INTEGER NOT NULL,
consumption BIGINT,
maxConsumption BIGINT
CONSTRAINT pk PRIMARY KEY (eventTime day_qualifier_column, iotID))
SALT_BUCKETS = 20;

This phoenix would create an HBASE table with 1 row per iotID and day, and all 
other column in block  in the same row, with suffixe for the time since the 
beginning of the day (in the optimize way => like opentsdb). In the same way, 
hour_qualifier_column would create 1 row per iotID and hour, and all other 
column in block  in the same row, with suffixe for the time since the beginning 
of the hour.

Of course, if i have, for example, on a row 20 column blocks (20 x comsumption 
and maxComsumption), the sql select statement will return 20 lines (like the 
lateral view explode in HiveQL)

it would be something like openTSDB model : 
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html#data-table-schema<http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html%23data-table-schema>

Maybe this would be optimize with data block encoding like fast_diff because 
only suffixe changes for a range of column qualifier on Hbase, and i think that 
native agregation coprocessor will work on it. (I think the future immutable 
data packing feature will not use the native coprocessor agreggation.)

I think this would improve the performance on SQL analysis (like OLAP on time 
series)

This is for time series use cases.

You could say that I could use modelisation in row and not in column but if i 
use salt_bucket, fast_diff encoding won't be optimal (because of the datetime 
in the rowkey).
You could say that I could use row timestamp, but for huge time series (not a 
lot of rowkey and a lot of version) ?

=> We have tried in line vs in column (static declaration in phoenix, 48 time 
slots) and for significant agreggation in column was 2X better.

Thank you by advance for your answer.

Best regards,

Nicolas DOUSSINET

Optimisation for times series

Reply via email to