[1/2] incubator-griffin-site git commit: profiling

guoyp Tue, 18 Sep 2018 00:56:42 -0700

Repository: incubator-griffin-site
Updated Branches:
  refs/heads/master 3891ad018 -> 077729686



profiling


Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/ce45b1dd
Tree: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/ce45b1dd
Diff: 
http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/ce45b1dd

Branch: refs/heads/master
Commit: ce45b1dd3fbf6f9cf18c5995ab5d6cfcaa37827e
Parents: 78070b8
Author: Lionel Liu <bhlx3l...@163.com>
Authored: Tue Sep 18 15:33:57 2018 +0800
Committer: Lionel Liu <bhlx3l...@163.com>
Committed: Tue Sep 18 15:33:57 2018 +0800

----------------------------------------------------------------------
 profiling.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 165 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/blob/ce45b1dd/profiling.md
----------------------------------------------------------------------
diff --git a/profiling.md b/profiling.md
index e7a473e..8422a17 100644
--- a/profiling.md
+++ b/profiling.md
@@ -3,3 +3,168 @@ layout: doc
 title:  "Profiling Use Case" 
 permalink: /docs/profiling.html
 ---
+## User Story
+Say we have one data set(demo_src), partitioned by hour, we want to know what 
is the data like for each hour.
+
+For simplicity, suppose both two data set have the same schema as this:
+```
+id                      bigint                                      
+age                     int                                         
+desc                    string                                      
+dt                      string                                      
+hour                    string 
+```
+both dt and hour are partitions, 
+
+as every day we have one daily partition dt(like 20180912), 
+
+for every day we have 24 hourly partitions(like 00, 01, 02, ..., 23).
+
+## Environment Preparation
+You need to prepare the environment for Apache Griffin measure module, 
including the following software:
+- JDK (1.8+)
+- Hadoop (2.6.0+)
+- Spark (2.2.1+)
+- Hive (2.2.0)
+
+## Build Griffin Measure Module
+1.  Download Griffin source package 
[here](https://www.apache.org/dist/incubator/griffin/0.3.0-incubating).
+2.  Unzip the source package.
+    ```
+    unzip griffin-0.3.0-incubating-source-release.zip
+    cd griffin-0.3.0-incubating-source-release
+    ```
+3.  Build Griffin jars.
+    ```
+    mvn clean install
+    ```
+    
+    Move the built griffin measure jar to your work path.
+    
+    ```
+    mv measure/target/measure-0.3.0-incubating.jar <work 
path>/griffin-measure.jar
+    ```
+
+## Data Preparation
+
+For our quick start, We will generate a hive table demo_src.
+```
+--create hive tables here. hql script
+--Note: replace hdfs location with your own path
+CREATE EXTERNAL TABLE `demo_src`(
+  `id` bigint,
+  `age` int,
+  `desc` string) 
+PARTITIONED BY (
+  `dt` string,
+  `hour` string)
+ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY '|'
+LOCATION
+  'hdfs:///griffin/data/batch/demo_src';
+```
+The data could be generated this:
+```
+1|18|student
+2|23|engineer
+3|42|cook
+...
+```
+You can download [demo data](/data/batch) and execute `./gen_demo_data.sh` to 
get the data source file.
+Then we will load data into hive table for every hour.
+```
+LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION 
(dt='20180912',hour='09');
+```
+Or you can just execute `./gen-hive-data.sh` in the downloaded directory 
above, to generate and load data into the tables hourly.
+
+## Define data quality measure
+
+#### Griffin env configuration 
+The environment config file: env.json
+```
+{
+  "spark": {
+    "log.level": "WARN"
+  },
+  "sinks": [
+    {
+      "type": "console"
+    },
+    {
+      "type": "hdfs",
+      "config": {
+        "path": "hdfs:///griffin/persist"
+      }
+    },
+    {
+      "type": "elasticsearch",
+      "config": {
+        "method": "post",
+        "api": "http://es:9200/griffin/accuracy";
+      }
+    }
+  ]
+}
+```
+
+#### Define griffin data quality 
+The DQ config file: dq.json
+
+```
+{
+  "name": "batch_prof",
+  "process.type": "batch",
+  "data.sources": [
+    {
+      "name": "src",
+      "baseline": true,
+      "connectors": [
+        {
+          "type": "hive",
+          "version": "1.2",
+          "config": {
+            "database": "default",
+            "table.name": "demo_tgt"
+          }
+        }
+      ]
+    }
+  ],
+  "evaluate.rule": {
+    "rules": [
+      {
+        "dsl.type": "griffin-dsl",
+        "dq.type": "profiling",
+        "out.dataframe.name": "prof",
+        "rule": "src.id.count() AS id_count, src.age.max() AS age_max, 
src.desc.length().max() AS desc_length_max",
+        "out": [
+          {
+            "type": "metric",
+            "name": "prof"
+          }
+        ]
+      }
+    ]
+  },
+  "sinks": ["CONSOLE", "HDFS"]
+}
+```
+
+## Measure data quality
+Submit the measure job to Spark, with config file paths as parameters.
+
+```
+spark-submit --class org.apache.griffin.measure.Application --master yarn 
--deploy-mode client --queue default \
+--driver-memory 1g --executor-memory 1g --num-executors 2 \
+<path>/griffin-measure.jar \
+<path>/env.json <path>/dq.json
+```
+
+## Report data quality metrics
+Then you can get the calculation log in console, after the job finishes, you 
can get the result metrics printed. The metrics will also be saved in hdfs: 
`hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS`.
+
+## Refine Data Quality report
+Depends on your business, you might need to refine your data quality measure 
further till your are satisfied.
+
+## More Details
+For more details about griffin measures, you can visit our documents in 
[github](https://github.com/apache/incubator-griffin/tree/master/griffin-doc).
\ No newline at end of file

[1/2] incubator-griffin-site git commit: profiling

Reply via email to