Hi team, I am thinking about a SQL and stream computing related problem, want to hear your opinions.
In stream computing, there is a typical case like this: *We want to calculate a big wide result table, which has one rowkey and ten value columns:* *create table result (* * rowkey varchar(127) PRIMARY KEY,* * col1 int,* * col2 int,* * ...* * col10 int* *);* Each of the value columns is calculated by a complex query, so there will be ten SQLs to calculate data for this table, for each sql: * First check whether there is a row for the specified `rowkey`. * If yes, then `update`, otherwise `insert`. There is actually a dedicated sql syntax called `MERGE` designed for this(SQL2008), a sample usage is: MERGE INTO result D USING (SELECT rowkey, col1 FROM input WHERE flag = 80) S ON (D.rowkey = S.rowkey) WHEN MATCHED THEN UPDATE SET D.col1 = S.col1 WHEN NOT MATCHED THEN INSERT (D.rowkey, D.col1) *The semantic fits perfectly, but it is very verbose, and normal users rarely used this syntax.* So my colleagues invented a new syntax for this scenario (Or more precisely, a new interpretation for the INSERT statement). For the above scenario, user will always write `insert` statement: insert into result(rowkey, col1) values(...) insert into result(rowkey, col2) values(...) The sql interpreter will do a trick behind the scene: if the `rowkey` exists, then update, otherwise `insert`. This solution is very concise, but violates the semantics of `insert`, using this solution INSERT will behave differently in batch & stream processing. How do you guys think? which do you prefer? What's your reasoning? Looking forward to your opinions, thanks in advance.
