This is an automated email from the ASF dual-hosted git repository.
guoyp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/griffin.git
The following commit(s) were added to refs/heads/master by this push:
new 73b76e0 update documentation
73b76e0 is described below
commit 73b76e03a1b00fa11b168f055160e0f82a3d34de
Author: iyuriysoft <[email protected]>
AuthorDate: Tue Jan 29 07:02:49 2019 +0800
update documentation
added extra information
Author: iyuriysoft <[email protected]>
Closes #479 from iyuriysoft/patch-4.
---
griffin-doc/measure/dsl-guide.md | 14 ++++++++++++++
griffin-doc/measure/measure-configuration-guide.md | 20 ++++++++++++++++++++
2 files changed, 34 insertions(+)
diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md
index 5296176..b9a4b8c 100644
--- a/griffin-doc/measure/dsl-guide.md
+++ b/griffin-doc/measure/dsl-guide.md
@@ -127,6 +127,14 @@ Profiling rule expression in Apache Griffin DSL is a
sql-like expression, with s
Distinctness rule expression in Apache Griffin DSL is a list of selection
expressions separated by comma, indicates the columns to check if is distinct.
e.g. `name, age`, `name, (age + 1) as next_age`
+### Uniqueness Rule
+Uniqueness rule expression in Apache Griffin DSL is a list of selection
expressions separated by comma, indicates the columns to check if is unique.
The uniqueness indicates the items without any replica of data.
+ e.g. `name, age`, `name, (age + 1) as next_age`
+
+### Completeness Rule
+Completeness rule expression in Apache Griffin DSL is a list of selection
expressions separated by comma, indicates the columns to check if is null.
+ e.g. `name, age`, `name, (age + 1) as next_age`
+
### Timeliness Rule
Timeliness rule expression in Apache Griffin DSL is a list of selection
expressions separated by comma, indicates the input time and output time
(calculate time as default if not set).
e.g. `ts`, `ts, end_ts`
@@ -167,6 +175,12 @@ For example, the dsl rule is `name, age`, which represents
the distinct requests
After the translation, the metrics will be persisted in table
`distinct_metric` and `dup_metric`.
+### Completeness
+For completeness, is to check for null. The columns you measure are incomplete
if they are null.
+- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as
table `total_count`.
+- **incomplete metric**: `SELECT count(*) as incomplete FROM source WHERE NOT
(id IS NOT NULL)`, save as table `incomplete_count`.
+- **complete metric**: `SELECT (source.total - incomplete_count.incomplete) AS
complete FROM source LEFT JOIN incomplete_count`, save as table
`complete_count`.
+
### Timeliness
For timeliness, is to measure the latency of each item, and get the statistics
of the latencies.
For example, the dsl rule is `ts, out_ts`, the first column means the input
time of item, the second column means the output time of item, if not set,
`__tmst` will be the default output time column. After the translation, the sql
rule is as below:
diff --git a/griffin-doc/measure/measure-configuration-guide.md
b/griffin-doc/measure/measure-configuration-guide.md
index 2522ee4..1013ae6 100644
--- a/griffin-doc/measure/measure-configuration-guide.md
+++ b/griffin-doc/measure/measure-configuration-guide.md
@@ -238,10 +238,30 @@ Above lists DQ job configure parameters.
* num: the duplicate number name in metric, optional.
* duplication.array: optional, if set as a non-empty string, the
duplication metric will be computed, and the group metric name is this string.
* with.accumulate: optional, default is true, if set as false, in
streaming mode, the data set will not compare with old data to check
distinctness.
+ + uniqueness dq type detail configuration
+ * source: name of data source to measure uniqueness.
+ * target: name of data source to compare with. It is always the same as
source, or more than source.
+ * unique: the unique count name in metric, optional.
+ * total: the total count name in metric, optional.
+ * dup: the duplicate count name in metric, optional.
+ * num: the duplicate number name in metric, optional.
+ * duplication.array: optional, if set as a non-empty string, the
duplication metric will be computed, and the group metric name is this string.
+ + completeness dq type detail configuration
+ * source: name of data source to measure completeness.
+ * total: name of data source to compare with. It is always the same as
source, or more than source.
+ * complete: the column name in metric, optional. The number of not null
values.
+ * incomplete: the column name in metric, optional. The number of null
values.
+ timeliness dq type detail configuration
* source: name of data source to measure timeliness.
* latency: the latency column name in metric, optional.
+ * total: column name, optional.
+ * avg: column name, optional. The average latency.
+ * step: column nmae, optional. The histogram where "bin" is
step=floor(latency/step.size).
+ * count: column name, optional. The number of the same latencies in the
concrete step.
+ * percentile: column name, optional.
* threshold: optional, if set as a time string like "1h", the items with
latency more than 1 hour will be record.
+ * step.size: optional, used to build the histogram of latencies, in
milliseconds (ex. "100").
+ * percentile.values: optional, used to compute the percentile metrics,
values between 0 and 1. For instance, We can see fastest and slowest latencies
if set [0.1, 0.9].
- **cache**: Cache output dataframe. Optional, valid only for "spark-sql" and
"df-ops" mode. Defaults to `false` if not specified.
- **out**: List of output sinks for the job.
+ Metric output.