[griffin] branch master updated: update documentation

guoyp Mon, 28 Jan 2019 15:03:55 -0800

This is an automated email from the ASF dual-hosted git repository.

guoyp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/griffin.git



The following commit(s) were added to refs/heads/master by this push:
     new 73b76e0  update documentation
73b76e0 is described below

commit 73b76e03a1b00fa11b168f055160e0f82a3d34de
Author: iyuriysoft <[email protected]>
AuthorDate: Tue Jan 29 07:02:49 2019 +0800

    update documentation
    
    added extra information
    
    Author: iyuriysoft <[email protected]>
    
    Closes #479 from iyuriysoft/patch-4.
---
 griffin-doc/measure/dsl-guide.md                   | 14 ++++++++++++++
 griffin-doc/measure/measure-configuration-guide.md | 20 ++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/griffin-doc/measure/dsl-guide.md b/griffin-doc/measure/dsl-guide.md
index 5296176..b9a4b8c 100644
--- a/griffin-doc/measure/dsl-guide.md
+++ b/griffin-doc/measure/dsl-guide.md
@@ -127,6 +127,14 @@ Profiling rule expression in Apache Griffin DSL is a 
sql-like expression, with s
 Distinctness rule expression in Apache Griffin DSL is a list of selection 
expressions separated by comma, indicates the columns to check if is distinct.
     e.g. `name, age`, `name, (age + 1) as next_age`
 
+### Uniqueness Rule
+Uniqueness rule expression in Apache Griffin DSL is a list of selection 
expressions separated by comma, indicates the columns to check if is unique. 
The uniqueness indicates the items without any replica of data.
+    e.g. `name, age`, `name, (age + 1) as next_age`
+
+### Completeness Rule
+Completeness rule expression in Apache Griffin DSL is a list of selection 
expressions separated by comma, indicates the columns to check if is null.
+    e.g. `name, age`, `name, (age + 1) as next_age`
+
 ### Timeliness Rule
 Timeliness rule expression in Apache Griffin DSL is a list of selection 
expressions separated by comma, indicates the input time and output time 
(calculate time as default if not set).  
        e.g. `ts`, `ts, end_ts`
@@ -167,6 +175,12 @@ For example, the dsl rule is `name, age`, which represents 
the distinct requests
 
 After the translation, the metrics will be persisted in table 
`distinct_metric` and `dup_metric`.
 
+### Completeness
+For completeness, is to check for null. The columns you measure are incomplete 
if they are null. 
+- **total count of source**: `SELECT COUNT(*) AS total FROM source`, save as 
table `total_count`.
+- **incomplete metric**: `SELECT count(*) as incomplete FROM source WHERE NOT 
(id IS NOT NULL)`, save as table `incomplete_count`.
+- **complete metric**: `SELECT (source.total - incomplete_count.incomplete) AS 
complete FROM source LEFT JOIN incomplete_count`, save as table 
`complete_count`.
+
 ### Timeliness
 For timeliness, is to measure the latency of each item, and get the statistics 
of the latencies.  
 For example, the dsl rule is `ts, out_ts`, the first column means the input 
time of item, the second column means the output time of item, if not set, 
`__tmst` will be the default output time column. After the translation, the sql 
rule is as below:  
diff --git a/griffin-doc/measure/measure-configuration-guide.md 
b/griffin-doc/measure/measure-configuration-guide.md
index 2522ee4..1013ae6 100644
--- a/griffin-doc/measure/measure-configuration-guide.md
+++ b/griffin-doc/measure/measure-configuration-guide.md
@@ -238,10 +238,30 @@ Above lists DQ job configure parameters.
     * num: the duplicate number name in metric, optional.
     * duplication.array: optional, if set as a non-empty string, the 
duplication metric will be computed, and the group metric name is this string.
     * with.accumulate: optional, default is true, if set as false, in 
streaming mode, the data set will not compare with old data to check 
distinctness.
+  + uniqueness dq type detail configuration
+    * source: name of data source to measure uniqueness.
+    * target: name of data source to compare with. It is always the same as 
source, or more than source.
+    * unique: the unique count name in metric, optional.
+    * total: the total count name in metric, optional.
+    * dup: the duplicate count name in metric, optional.
+    * num: the duplicate number name in metric, optional.
+    * duplication.array: optional, if set as a non-empty string, the 
duplication metric will be computed, and the group metric name is this string.
+  + completeness dq type detail configuration
+    * source: name of data source to measure completeness.
+    * total: name of data source to compare with. It is always the same as 
source, or more than source.
+    * complete: the column name in metric, optional. The number of not null 
values.
+    * incomplete: the column name in metric, optional. The number of null 
values.
   + timeliness dq type detail configuration
     * source: name of data source to measure timeliness.
     * latency: the latency column name in metric, optional.
+    * total: column name, optional.
+    * avg: column name, optional. The average latency.
+    * step: column nmae, optional. The histogram where "bin" is 
step=floor(latency/step.size).
+    * count: column name, optional. The number of the same latencies in the 
concrete step.
+    * percentile: column name, optional.
     * threshold: optional, if set as a time string like "1h", the items with 
latency more than 1 hour will be record.
+    * step.size: optional, used to build the histogram of latencies, in 
milliseconds (ex. "100").
+    * percentile.values: optional, used to compute the percentile metrics, 
values between 0 and 1. For instance, We can see fastest and slowest latencies 
if set [0.1, 0.9].
 - **cache**: Cache output dataframe. Optional, valid only for "spark-sql" and 
"df-ops" mode. Defaults to `false` if not specified.
 - **out**: List of output sinks for the job.
   + Metric output.

[griffin] branch master updated: update documentation

Reply via email to