Hi team,

I am thinking about a SQL and stream computing related problem, want to
hear your opinions.

In stream computing, there is a typical case like this:

*We want to calculate a big wide result table, which has one rowkey and ten
value columns:*
*create table result (*
*    rowkey varchar(127) PRIMARY KEY,*
*    col1 int,*
*    col2 int,*
*    ...*
*    col10 int*
*);*

Each of the value columns is calculated by a complex query, so there will
be ten SQLs to calculate
data for this table, for each sql:

* First check whether there is a row for the specified `rowkey`.
* If yes, then `update`, otherwise `insert`.

There is actually a dedicated sql syntax called `MERGE` designed for
this(SQL2008), a sample usage is:

MERGE INTO result D
   USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S
   ON (D.rowkey = S.rowkey)
   WHEN MATCHED THEN UPDATE SET D.col1 = S.col1
   WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1)


*The semantic fits perfectly, but it is very verbose, and normal users
rarely used this syntax.*

So my colleagues invented a new syntax for this scenario (Or more
precisely, a new interpretation for the INSERT statement). For the above
scenario, user will always write `insert` statement:

insert into result(rowkey, col1) values(...)
insert into result(rowkey, col2) values(...)

The sql interpreter will do a trick behind the scene: if the `rowkey`
exists, then update, otherwise `insert`. This solution is very concise, but
violates the semantics of `insert`, using this solution INSERT will behave
differently in batch & stream processing.

How do you guys think? which do you prefer? What's your reasoning?

Looking forward to your opinions, thanks in advance.

Reply via email to