update dsl-guide and set coalesce as null string by default
Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin/commit/e9229154 Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin/tree/e9229154 Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin/diff/e9229154 Branch: refs/heads/griffin-0.2.0-incubating-rc4 Commit: e9229154832f9802bdffe49039df3f70becf1bc6 Parents: 0ba7e07 Author: Lionel Liu <[email protected]> Authored: Tue Apr 24 17:56:27 2018 +0800 Committer: Lionel Liu <[email protected]> Committed: Tue Apr 24 17:56:27 2018 +0800 ---------------------------------------------------------------------- griffin-doc/measure/dsl-guide.md | 26 +++++++++++++++++--- .../measure/rule/dsl/expr/SelectExpr.scala | 2 +- 2 files changed, 23 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/e9229154/griffin-doc/measure/dsl-guide.md ---------------------------------------------------------------------- diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md index 0fc8059..7370274 100644 --- a/griffin-doc/measure/dsl-guide.md +++ b/griffin-doc/measure/dsl-guide.md @@ -125,6 +125,10 @@ Profiling rule expression in Griffin DSL is a sql-like expression, with select c Uniqueness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is unique. e.g. `name, age`, `name, (age + 1) as next_age` +### Distinctness Rule +Distinctness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the columns to check if is distinct. + e.g. `name, age`, `name, (age + 1) as next_age` + ### Timeliness Rule Timeliness rule expression in Griffin DSL is a list of selection expressions separated by comma, indicates the input time and output time (calculate time as default if not set). e.g. `ts`, `ts, end_ts` @@ -149,12 +153,12 @@ For profiling, the request is always the aggregation function of data, the rule For example, the dsl rule is `source.cntry, source.id.count(), source.age.max() group by source.cntry`, which represents the profiling requests. After the translation, the sql rule is as below: - **profiling sql rule**: `SELECT source.cntry, count(source.id), max(source.age) FROM source GROUP BY source.cntry`, save as table `profiling`. -After the translation, the metrics will be persisted in table `profiling`. +After the translation, the metrics will be persisted in table `profiling`. ### Uniqueness -For uniqueness, or called duplicate, is to find out the duplicate items of data, and rollup the items count group by duplicate times. -For example, the dsl rule is `name, age`, which represents the duplicate requests, in this case, source and target are the same data set. After the translation, the sql rule is as below: -- **get distinct items from source**: `SELECT name, age FROM source`, save as table `src`. +For uniqueness, or called duplicate, is to find out the duplicate items of data, and rollup the items count group by duplicate times. +For example, the dsl rule is `name, age`, which represents the duplicate requests, in this case, source and target are the same data set. After the translation, the sql rule is as below: +- **get distinct items from source**: `SELECT name, age FROM source`, save as table `src`. - **get all items from target**: `SELECT name, age FROM target`, save as table `tgt`. - **join two tables**: `SELECT src.name, src.age FROM tgt RIGHT JOIN src ON coalesce(src.name, '') = coalesce(tgt.name, '') AND coalesce(src.age, '') = coalesce(tgt.age, '')`, save as table `joined`. - **get items duplication**: `SELECT name, age, (count(*) - 1) AS dup FROM joined GROUP BY name, age`, save as table `grouped`. @@ -166,6 +170,20 @@ For example, the dsl rule is `name, age`, which represents the duplicate request After the translation, the metrics will be persisted in table `dup_metric`. +### Distinctness +For distinctness, is to find out the duplicate items of data, the same as uniqueness in batch mode, but with some differences in streaming mode. +In most time, you need distinctness other than uniqueness. +For example, the dsl rule is `name, age`, which represents the distinct requests, in this case, source and target are the same data set. After the translation, the sql rule is as below: +- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as table `total_count`. +- **group by fields**: `SELECT name, age, (COUNT(*) - 1) AS dup, TRUE AS dist FROM source GROUP BY name, age`, save as table `dup_count`. +- **distinct metric**: `SELECT COUNT(*) AS dist_count FROM dup_count WHERE dist`, save as table `distinct_metric`. +- **source join distinct metric**: `SELECT source.*, dup_count.dup AS dup, dup_count.dist AS dist FROM source LEFT JOIN dup_count ON source.name = dup_count.name AND source.age = dup_count.age`, save as table `dist_joined`. +- **add row number**: `SELECT *, ROW_NUMBER() OVER (DISTRIBUTE BY name, age SORT BY dist) row_num FROM dist_joined`, save as table `row_numbered`. +- **duplicate records**: `SELECT name, age, dup FROM row_numbered WHERE NOT dist OR row_num > 1`, save as table `dup_records`. +- **duplicate metric**: `SELECT name, age, dup, COUNT(*) AS num FROM dup_records GROUP BY name, age, dup`, save as table `dup_metric`. + +After the translation, the metrics will be persisted in table `distinct_metric` and `dup_metric`. + ### Timeliness For timeliness, is to measure the latency of each item, and get the statistics of the latencies. For example, the dsl rule is `ts, out_ts`, the first column means the input time of item, the second column means the output time of item, if not set, `__tmst` will be the default output time column. After the translation, the sql rule is as below: http://git-wip-us.apache.org/repos/asf/incubator-griffin/blob/e9229154/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala ---------------------------------------------------------------------- diff --git a/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala b/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala index d6e350b..0c3c451 100644 --- a/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala +++ b/measure/src/main/scala/org/apache/griffin/measure/rule/dsl/expr/SelectExpr.scala @@ -115,7 +115,7 @@ case class SelectionExpr(head: HeadExpr, selectors: Seq[SelectExpr], aliasOpt: O selectors.lastOption match { case None => desc case Some(sel: FunctionSelectExpr) => desc - case _ => s"coalesce(${desc}, 'null')" + case _ => s"coalesce(${desc}, '')" } } def alias: Option[String] = {
