[griffin] branch master updated: Fix deployment doc

guoyp Thu, 14 Mar 2019 07:32:10 -0700

This is an automated email from the ASF dual-hosted git repository.

guoyp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/griffin.git



The following commit(s) were added to refs/heads/master by this push:
     new a021b61  Fix deployment doc
a021b61 is described below

commit a021b61aea8f88194ff2687a3b24f8ad26124df3
Author: Eugene <[email protected]>
AuthorDate: Thu Mar 14 22:31:36 2019 +0800

    Fix deployment doc
    
    GRIFFIN-239
    
    Complement and refresh deployment steps for Griffin.
    
    Author: Eugene <[email protected]>
    
    Closes #488 from toyboxman/deploy-doc.
---
 griffin-doc/deploy/deploy-guide.md | 695 ++++++++++++++++++++++++++++---------
 1 file changed, 532 insertions(+), 163 deletions(-)

diff --git a/griffin-doc/deploy/deploy-guide.md 
b/griffin-doc/deploy/deploy-guide.md
index 7327b15..cd15213 100644
--- a/griffin-doc/deploy/deploy-guide.md
+++ b/griffin-doc/deploy/deploy-guide.md
@@ -18,15 +18,18 @@ under the License.
 -->
 
 # Apache Griffin Deployment Guide
-For Apache Griffin users, please follow the instructions below to deploy 
Apache Griffin in your environment. Note that there are some dependencies that 
should be installed firstly.
+If you are a new guy for Apache Griffin, please follow the instructions below 
to deploy Apache Griffin in your environment. Note that those steps will 
install all products in one physical machine, so you have to tune 
configurations depending on true topology.
 
 ### Prerequisites
 Firstly you need to install and configure following software products, here we 
use [ubuntu-18.10](https://www.ubuntu.com/download) as sample OS to prepare all 
dependencies.
 ```bash
 # put all download packages into /apache folder
-$ mkdir /home/user/software
-$ sudo ln -s /home/user/software /apache
+$ mkdir /home/<user>/software
+$ mkdir /home/<user>/software/data
+$ sudo ln -s /home/<user>/software /apache
 $ sudo ln -s /apache/data /data
+$ mkdir /apache/tmp
+$ mkdir /apache/tmp/hive
 ```
 
 - JDK (1.8 or later versions)
@@ -56,17 +59,19 @@ $ node -v
 $ npm -v
 ```
 
-- [Hadoop](http://apache.claz.org/hadoop/common/) (2.6.0 or later), you can 
get some help 
[here](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html).
+- [Hadoop](http://apache.claz.org/hadoop/common/) (2.6.0 or later), you can 
get some helps 
[here](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html).
 
-- [Hive](http://apache.claz.org/hive/) (version 2.x), you can get some help 
[here](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHive).
+- [Hive](http://apache.claz.org/hive/) (version 2.x), you can get some helps 
[here](https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHive).
 
-- [Spark](http://spark.apache.org/downloads.html) (version 2.2.1), if you want 
to install Pseudo Distributed/Single Node Cluster, you can get some help 
[here](http://why-not-learn-something.blogspot.com/2015/06/spark-installation-pseudo.html).
+- [Spark](http://spark.apache.org/downloads.html) (version 2.2.1), if you want 
to install Pseudo Distributed/Single Node Cluster, you can get some helps 
[here](http://why-not-learn-something.blogspot.com/2015/06/spark-installation-pseudo.html).
 
-- [Livy](http://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip), you can 
get some help [here](http://livy.io/quickstart.html).
+- [Livy](http://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip), you can 
get some helps [here](http://livy.io/quickstart.html).
 
 - [ElasticSearch](https://www.elastic.co/downloads/elasticsearch) (5.0 or 
later versions).
        ElasticSearch works as a metrics collector, Apache Griffin produces 
metrics into it, and our default UI gets metrics from it, you can use them by 
your own way as well.
 
+- [Scala](https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.tgz), you 
can get some helps [here](https://www.scala-lang.org/).
+
 ### Configuration
 
 #### PostgreSQL
@@ -93,7 +98,7 @@ mysql -u <username> -p quartz < Init_quartz_mysql_innodb.sql
 
 #### Set Env
 
-export those variables below, or create hadoop_env.sh and put it into .bashrc
+Export those variables below, or create griffin_env.sh and put it into .bashrc.
 ```bash
 #!/bin/bash
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
@@ -111,13 +116,13 @@ export LIVY_HOME=/apache/livy
 export HIVE_HOME=/apache/hive
 export YARN_HOME=/apache/hadoop
 export SCALA_HOME=/apache/scala
+
+export 
PATH=$PATH:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$LIVY_HOME/bin:$SCALA_HOME/bin
 ```
 
 #### Hadoop
-
 * **update configuration**
 
-here are sample configurations for hadoop<br>
 Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/core-site.xml**
 ```xml
 <configuration>
@@ -138,14 +143,6 @@ Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/hd
         <value>1</value>
     </property>
     <property>
-        <name>dfs.namenode.servicerpc-address</name>
-        <value>127.0.0.1:9001</value>
-    </property>
-    <property>
-        <name>dfs.namenode.rpc-address</name>
-        <value>127.0.0.1:9002</value>
-    </property>
-    <property>
         <name>dfs.namenode.name.dir</name>
         <value>file:///data/hadoop-data/nn</value>
     </property>
@@ -175,12 +172,20 @@ Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/hd
 * **start/stop hadoop nodes**
 ```bash
 # format name node
+# NOTE: if you already have executed namenode-format before, it'll change 
cluster ID in 
+# name node's VERSION file after you run it again. so you need to guarantee 
same cluster ID
+# in data node's VERSION file, otherwise data node will fail to start up.
+# VERSION file resides in /apache/data/hadoop-data/nn, snn, dn denoted in 
previous config. 
 /apache/hadoop/bin/hdfs namenode -format
-# start namenode/datanode
+# start namenode/secondarynamenode/datanode
+# NOTE: you should use 'ps -ef|grep java' to check if namenode/secondary 
namenode/datanode
+# are available after starting dfs service.
+# if there is any error, please find clues from /apache/hadoop/logs/
 /apache/hadoop/sbin/start-dfs.sh
 # stop all nodes
-/apache/hadoop/sbin/stop-all.sh
+/apache/hadoop/sbin/stop-dfs.sh
 ```
+Here you can access http://127.0.0.1:50070/ to check name node.
 * **start/stop hadoop ResourceManager**
 ```bash
 # manually clear the ResourceManager state store
@@ -190,6 +195,9 @@ Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/hd
 # stop the ResourceManager
 /apache/hadoop/sbin/yarn-daemon.sh stop resourcemanager
 ```
+Here you can access http://127.0.0.1:8088/cluster to check hadoop cluster.
+ 
+Hadoop daemons also expose some information over HTTP like 
http://127.0.0.1:8088/stacks. Please refer to 
[blog](https://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/)
 * **start/stop hadoop NodeManager**
 ```bash
 # startup the NodeManager
@@ -197,7 +205,8 @@ Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/hd
 # stop the NodeManager
 /apache/hadoop/sbin/yarn-daemon.sh stop nodemanager
 ```
-* **start/stop hadoop HistoryServer**
+Here you can access http://127.0.0.1:8088/cluster/nodes to check hadoop nodes, 
you should see one node in the list.
+* **(optional) start/stop hadoop HistoryServer**
 ```bash
 # startup the HistoryServer
 /apache/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
@@ -206,18 +215,32 @@ Put site-specific property overrides in this file 
**/apache/hadoop/etc/hadoop/hd
 ```
 
 #### Hive
-You need to make sure that your spark cluster could access your HiveContext.
 * **update configuration**
 Copy hive/conf/hive-site.xml.template to hive/conf/hive-site.xml and update 
some fields.
 ```xml
 +++ hive/conf/hive-site.xml    2018-12-16 11:17:51.000000000 +0800
+@@ -72,12 +72,12 @@
+   </property>
+   <property>
+     <name>hive.exec.local.scratchdir</name>
+-    <value>${system:java.io.tmpdir}/${system:user.name}</value>
++    <value>/apache/tmp/hive</value>
+     <description>Local scratch space for Hive jobs</description>
+   </property>
+   <property>
+     <name>hive.downloaded.resources.dir</name>
+-    <value>${system:java.io.tmpdir}/${hive.session.id}_resources</value>
++    <value>/apache/tmp/hive/${hive.session.id}_resources</value>
+     <description>Temporary local directory for added resources in the remote 
file system.</description>
+   </property>
+   <property>
 @@ -368,7 +368,7 @@
    </property>
    <property>
      <name>hive.metastore.uris</name>
 -    <value/>
 +    <value>thrift://127.0.0.1:9083</value>
-     <description>Thrift URI for the remote metastore. Used by metastore 
client to connect to remote metastore.</description>
+     <description>Thrift URI for the remote metastore.</description>
    </property>
    <property>
 @@ -527,7 +527,7 @@
@@ -256,6 +279,23 @@ Copy hive/conf/hive-site.xml.template to 
hive/conf/hive-site.xml and update some
      <description>Username to use against metastore database</description>
    </property>
    <property>
+@@ -1682,7 +1682,7 @@
+   </property>
+   <property>
+     <name>hive.querylog.location</name>
+-    <value>${system:java.io.tmpdir}/${system:user.name}</value>
++    <value>/apache/tmp/hive</value>
+     <description>Location of Hive run time structured log file</description>
+   </property>
+   <property>
+@@ -3973,7 +3973,7 @@
+   </property>
+   <property>
+     <name>hive.server2.logging.operation.log.location</name>
+-    <value>${system:java.io.tmpdir}/${system:user.name}/operation_logs</value>
++    <value>/apache/tmp/hive/operation_logs</value>
+   </property>
+   <property>
 ```
 
 * **start up hive metastore service**
@@ -265,159 +305,270 @@ Copy hive/conf/hive-site.xml.template to 
hive/conf/hive-site.xml and update some
 ```
 
 #### Spark
-* **start up spark nodes**
+* **update configuration**
+
+Check $SPARK_HOME/conf/spark-default.conf
+```
+spark.master                    yarn-cluster
+spark.serializer                org.apache.spark.serializer.KryoSerializer
+spark.yarn.jars                 hdfs:///home/spark_lib/*
+spark.yarn.dist.files          hdfs:///home/spark_conf/hive-site.xml
+spark.sql.broadcastTimeout  500
+```
+Check $SPARK_HOME/conf/spark-env.sh
+```
+HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
+SPARK_MASTER_HOST=localhost
+SPARK_MASTER_PORT=7077
+SPARK_MASTER_WEBUI_PORT=8082
+SPARK_LOCAL_IP=localhost
+SPARK_PID_DIR=/apache/pids
+```
+Upload some files otherwise you will hit `Error: Could not find or load main 
class org.apache.spark.deploy.yarn.ApplicationMaster`, when you schedule spark 
applications.
+```bash
+hdfs dfs -mkdir /home/spark_lib
+hdfs dfs -mkdir /home/spark_conf
+hdfs dfs -put $SPARK_HOME/jars/*  hdfs:///home/spark_lib/
+hdfs dfs -put $HIVE_HOME/conf/hive-site.xml hdfs:///home/spark_conf/
+```
+* **start/stop spark nodes**
 ```bash
 cp /apache/hive/conf/hive-site.xml /apache/spark/conf/
+# start master and slave nodes
 /apache/spark/sbin/start-master.sh
 /apache/spark/sbin/start-slave.sh  spark://localhost:7077
+
+# stop master and slave nodes
+/apache/spark/sbin/stop-slaves.sh 
+/apache/spark/sbin/stop-master.sh 
+
+# stop all
+/apache/spark/sbin/stop-all.sh
 ```
 
 #### Livy
 Apache Griffin need to schedule spark jobs by server, we use livy to submit 
our jobs.
-For some issues of Livy for HiveContext, we need to download 3 files or get 
them from Spark lib `$SPARK_HOME/lib/`, and put them into HDFS.
-```
-datanucleus-api-jdo-3.2.6.jar
-datanucleus-core-3.2.10.jar
-datanucleus-rdbms-3.2.9.jar
-```
+
 * **update configuration**
 ```bash
-mkdir livy/logs
-
-# update livy/conf/livy.conf
+mkdir /apache/livy/logs
+```
+Update $LIVY_HOME/conf/livy.conf
+```bash
+# update /apache/livy/conf/livy.conf
 livy.server.host = 127.0.0.1
 livy.spark.master = yarn
 livy.spark.deployMode = cluster
 livy.repl.enableHiveContext = true
+livy.server.port 8998
 ```
 * **start up livy**
 ```bash
-/apache/livy/LivyServer
+/apache/livy/bin/livy-server start
 ```
 
 #### Elasticsearch
+* **update configuration**
 
-You might want to create Elasticsearch index in advance, in order to set 
number of shards, replicas, and other settings to desired values:
+Update $ES_HOME/config/elasticsearch.yml
+```
+network.host: 127.0.0.1
+http.cors.enabled: true
+http.cors.allow-origin: "*"
+```
+* **start up elasticsearch**
+```bash
+/apache/elastic/bin/elasticsearch
+```
+You can access http://127.0.0.1:9200/ to check elasticsearch service.
+
+#### Griffin
+You can download latest package from [official 
link](http://griffin.apache.org/docs/latest.html), or locally build on [source 
codes](https://github.com/apache/griffin.git).
+
+Before building Griffin, you have to update those configuration depending on 
previous steps's configuration.
+
+* **service/src/main/resources/application.properties**
+
+You can get more detailed configuration description in 
[here](#griffin-customization).
+```
+# Apache Griffin server port (default 8080)
+server.port = 8080
+spring.application.name=griffin_service
+
+# db configuration
+spring.datasource.url=jdbc:postgresql://localhost:5432/myDB?autoReconnect=true&useSSL=false
+spring.datasource.username=king
+spring.datasource.password=secret
+spring.jpa.generate-ddl=true
+spring.datasource.driver-class-name=org.postgresql.Driver
+spring.jpa.show-sql=true
+
+# Hive metastore
+hive.metastore.uris=thrift://localhost:9083
+hive.metastore.dbname=default
+hive.hmshandler.retry.attempts=15
+hive.hmshandler.retry.interval=2000ms
+# Hive cache time
+cache.evict.hive.fixedRate.in.milliseconds=900000
+
+# Kafka schema registry
+kafka.schema.registry.url=http://localhost:8081
+# Update job instance state at regular intervals
+jobInstance.fixedDelay.in.milliseconds=60000
+# Expired time of job instance which is 7 days that is 604800000 
milliseconds.Time unit only supports milliseconds
+jobInstance.expired.milliseconds=604800000
+# schedule predicate job every 5 minutes and repeat 12 times at most
+#interval time unit s:second m:minute h:hour d:day,only support these four 
units
+predicate.job.interval=5m
+predicate.job.repeat.count=12
+# external properties directory location
+external.config.location=
+# external BATCH or STREAMING env
+external.env.location=
+# login strategy ("default" or "ldap")
+login.strategy=default
+# ldap
+ldap.url=ldap://hostname:port
[email protected]
+ldap.searchBase=DC=org,DC=example
+ldap.searchPattern=(sAMAccountName={0})
+# hdfs default name
+fs.defaultFS=
+
+# elasticsearch
+# elasticsearch.host = <IP>
+# elasticsearch.port = <elasticsearch rest port>
+# elasticsearch.user = user
+# elasticsearch.password = password
+elasticsearch.host=localhost
+elasticsearch.port=9200
+elasticsearch.scheme=http
+
+# livy
+livy.uri=http://localhost:8998/batches
+# yarn url
+yarn.uri=http://localhost:8088
+# griffin event listener
+internal.event.listeners=GriffinJobEventHook
+```  
+
+* **service/src/main/resources/quartz.properties**
+```
+org.quartz.scheduler.instanceName=spring-boot-quartz
+org.quartz.scheduler.instanceId=AUTO
+org.quartz.threadPool.threadCount=5
+org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
+# If you use postgresql, set this property value to 
org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
+# If you use mysql, set this property value to 
org.quartz.impl.jdbcjobstore.StdJDBCDelegate
+# If you use h2, it's ok to set this property value to StdJDBCDelegate, 
PostgreSQLDelegate or others
+org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
+org.quartz.jobStore.useProperties=true
+org.quartz.jobStore.misfireThreshold=60000
+org.quartz.jobStore.tablePrefix=QRTZ_
+org.quartz.jobStore.isClustered=true
+org.quartz.jobStore.clusterCheckinInterval=20000
+```
+
+* **service/src/main/resources/sparkProperties.json**
+
+**griffin measure path** is the location where you should put the jar file of 
measure module.
 ```
-curl -XPUT http://es:9200/griffin -d '
 {
-    "aliases": {},
-    "mappings": {
-        "accuracy": {
-            "properties": {
-                "name": {
-                    "fields": {
-                        "keyword": {
-                            "ignore_above": 256,
-                            "type": "keyword"
-                        }
-                    },
-                    "type": "text"
-                },
-                "tmst": {
-                    "type": "date"
-                }
-            }
-        }
+    "file": "hdfs:///<griffin measure path>/griffin-measure.jar",
+    "className": "org.apache.griffin.measure.Application",
+    "name": "griffin",
+    "queue": "default",
+    "numExecutors": 3,
+    "executorCores": 1,
+    "driverMemory": "1g",
+    "executorMemory": "1g",
+    "conf": {
+        "spark.yarn.dist.files": "hdfs:///<path to>/hive-site.xml"
     },
-    "settings": {
-        "index": {
-            "number_of_replicas": "2",
-            "number_of_shards": "5"
-        }
+    "files": [
+    ],
+    "jars": [
+    ]
+}
+```
+
+* **service/src/main/resources/env/env_batch.json**
+
+Adjust sinks according to your requirement. At least, you will need to adjust 
HDFS output
+directory (hdfs:///griffin/persist by default), and Elasticsearch URL 
(http://es:9200/griffin/accuracy by default).
+Similar changes are required in `env_streaming.json`.
+```
+{
+  "spark": {
+    "log.level": "WARN"
+  },
+  "sinks": [
+    {
+      "type": "CONSOLE",
+      "config": {
+        "max.log.lines": 10
+      }
+    },
+    {
+      "type": "HDFS",
+      "config": {
+        "path": "hdfs:///griffin/persist",
+        "max.persist.lines": 10000,
+        "max.lines.per.file": 10000
+      }
+    },
+    {
+      "type": "ELASTICSEARCH",
+      "config": {
+        "method": "post",
+        "api": "http://127.0.0.1:9200/griffin/accuracy";,
+        "connection.timeout": "1m",
+        "retry": 10
+      }
     }
+  ],
+  "griffin.checkpoint": []
 }
-'
-```
-You should also modify some configurations of Apache Griffin for your 
environment.
-
-- <b>service/src/main/resources/application.properties</b>
-
-    ```
-    # Apache Griffin server port (default 8080)
-    server.port = 8080
-    # jpa
-    spring.datasource.url = jdbc:postgresql://<your 
IP>:5432/quartz?autoReconnect=true&useSSL=false
-    spring.datasource.username = <user name>
-    spring.datasource.password = <password>
-    spring.jpa.generate-ddl=true
-    spring.datasource.driverClassName = org.postgresql.Driver
-    spring.jpa.show-sql = true
-
-    # hive metastore
-    hive.metastore.uris = thrift://<your IP>:9083
-    hive.metastore.dbname = <hive database name>    # default is "default"
-
-    # external properties directory location, ignore it if not required
-    external.config.location =
-
-       # login strategy, default is "default"
-       login.strategy = <default or ldap>
-
-       # ldap properties, ignore them if ldap is not enabled
-       ldap.url = ldap://hostname:port
-       ldap.email = @example.com
-       ldap.searchBase = DC=org,DC=example
-       ldap.searchPattern = (sAMAccountName={0})
-
-       # hdfs, ignore it if you do not need predicate job
-       fs.defaultFS = hdfs://<hdfs-default-name>
-
-       # elasticsearch
-       elasticsearch.host = <your IP>
-       elasticsearch.port = <your elasticsearch rest port>
-       # authentication properties, uncomment if basic authentication is 
enabled
-       # elasticsearch.user = user
-       # elasticsearch.password = password
-       # livy
-       # Port Livy: 8998 Livy2:8999
-       livy.uri=http://localhost:8999/batches
-
-       # yarn url
-       yarn.uri=http://localhost:8088
-
-       
-    ```
-
-- <b>service/src/main/resources/sparkProperties.json</b>
-    ```
-       {
-         "file": "hdfs:///<griffin measure path>/griffin-measure.jar",
-         "className": "org.apache.griffin.measure.Application",
-         "name": "griffin",
-         "queue": "default",
-         "numExecutors": 3,
-         "executorCores": 1,
-         "driverMemory": "1g",
-         "executorMemory": "1g",
-         "conf": {
-               "spark.yarn.dist.files": "hdfs:///<path to>/hive-site.xml"
-        },
-         "files": [
-         ],
-         "jars": [
-         ]
-       }
-
-    ```
-    - \<griffin measure path> is the location where you should put the jar 
file of measure module.
-
-- <b>service/src/main/resources/env/env_batch.json</b>
-
-    Adjust sinks according to your requirement. At least, you will need to 
adjust HDFS output
-    directory (hdfs:///griffin/persist by default), and Elasticsearch URL 
(http://es:9200/griffin/accuracy by default).
-    Similar changes are required in `env_streaming.json`.
-
-#### Compression
+```
+
+It's easy to build Griffin, just run maven command `mvn clean install`. 
Successfully building, you can get two jars 
`service-0.4.0.jar`,`measure-0.4.0.jar` from target folder in service and 
measure module.
+
+Upload measure's jar to hadoop folder.
+```
+# change jar name
+mv measure-0.4.0.jar griffin-measure.jar
+mv service-0.4.0.jar griffin-service.jar
+# upload measure jar file
+hdfs dfs -put griffin-measure.jar /griffin/
+```
+
+Startup service.jar，run Griffin management service.
+```
+cd $GRIFFIN_HOME
+nohup java -jar griffin-service.jar>service.out 2>&1 &
+```
+
+After a few seconds, we can visit our default UI of Apache Griffin (by default 
the port of spring boot is 8080).
+```
+http://<your IP>:8080
+```
+
+You can conduct UI operations following the steps [here](../ui/user-guide.md).
+
+**Note**: The UI does not support all the backend features, to experience the 
advanced features you can use service's [api](../service/api-guide.md) directly.
+
+##### Griffin Customization
+- Compression
 
 Griffin Service is regular Spring Boot application, so it supports all 
customizations from Spring Boot.
 To enable output compression, the following should be added to 
`application.properties`:
 ```
 server.compression.enabled=true
-server.compression.mime-types=application/json,application/xml,text/html,text/xml,text/plain,application/javascript,text/css
+server.compression.mime-types=application/json,application/xml,text/html,\
+                              
text/xml,text/plain,application/javascript,text/css
 ```
 
-#### SSL
+- SSL
 
 It is possible to enable SSL encryption for api and web endpoints. To do that, 
you will need to prepare keystore in Spring-compatible format (for example, 
PKCS12), and add the following values to `application.properties`:
 ```
@@ -427,7 +578,7 @@ server.ssl.keyStoreType=PKCS12
 server.ssl.keyAlias=your_key_alias
 ```
 
-#### LDAP
+- LDAP
 
 The following properties are available for LDAP:
  - **ldap.url**: URL of LDAP server.
@@ -438,33 +589,251 @@ The following properties are available for LDAP:
  - **ldap.bindDN**: Optional DN of service account used for user lookup. 
Useful if user's DN is different than attribute used as user's login, or if 
users' DNs are ambiguous.
  - **ldap.bindPassword**: Optional password of bind service account.
 
-### Build and Run
+#### Launch Griffin Demo
 
-Build the whole project and deploy. (NPM should be installed)
+* **create hadoop folder**
+```bash
+$ hdfs dfs -ls /
+Found 3 items
+drwxr-xr-x   - king supergroup          0 2019-02-21 17:25 /data
+drwx-wx-wx   - king supergroup          0 2019-02-21 16:45 /tmp
+drwxr-xr-x   - king supergroup          0 2019-02-26 08:48 /user
+
+$ hdfs dfs -mkdir /griffin
 
-  ```
-  mvn clean install
-  ```
+$ hdfs dfs -ls /
+Found 4 items
+drwxr-xr-x   - king supergroup          0 2019-02-21 17:25 /data
+drwxr-xr-x   - king supergroup          0 2019-02-26 10:30 /griffin
+drwx-wx-wx   - king supergroup          0 2019-02-21 16:45 /tmp
+drwxr-xr-x   - king supergroup          0 2019-02-26 08:48 /user
 
-Put jar file of measure module into \<griffin measure path> in HDFS
+$ hdfs dfs -put griffin-measure.jar /griffin/
 
+$ hdfs dfs -ls /griffin
+-rw-r--r--   1 king supergroup   30927307 2019-02-26 10:36 
/griffin/griffin-measure.jar
 ```
-cp measure/target/measure-<version>-incubating-SNAPSHOT.jar 
measure/target/griffin-measure.jar
-hdfs dfs -put measure/target/griffin-measure.jar <griffin measure path>/
-  ```
+Here you can refer to [dfs 
commands](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfs),
 
+get command 
[examples](http://fibrevillage.com/storage/630-using-hdfs-command-line-to-manage-files-and-directories-on-hadoop).
 
 
-After all environment services startup, we can start our server.
+* **integrate hadoop and hive service**
+```bash
+# create /home/spark_conf
+# -p option behavior is much like Unix mkdir -p, creating parent directories 
along the path.
+hdfs dfs -mkdir -p /home/spark_conf
 
-  ```
-  java -jar service/target/service.jar
-  ```
+# upload hive-site.xml
+hdfs dfs -put hive-site.xml /home/spark_conf/
+```
 
-After a few seconds, we can visit our default UI of Apache Griffin (by default 
the port of spring boot is 8080).
+* **prepare demo tables**
+```bash
+# login hive client
+/apache/hive/bin/hive --database default
+
+# create demo tables
+hive> CREATE EXTERNAL TABLE `demo_src`(
+  `id` bigint,
+  `age` int,
+  `desc` string) 
+PARTITIONED BY (
+  `dt` string,
+  `hour` string)
+ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY '|'
+LOCATION
+  'hdfs://127.0.0.1:9000/griffin/data/batch/demo_src';
+  
+hive> CREATE EXTERNAL TABLE `demo_tgt`(
+  `id` bigint,
+  `age` int,
+  `desc` string) 
+PARTITIONED BY (
+  `dt` string,
+  `hour` string)
+ROW FORMAT DELIMITED
+  FIELDS TERMINATED BY '|'
+LOCATION
+  'hdfs://127.0.0.1:9000/griffin/data/batch/demo_tgt';
+
+# check tables created  
+hive> show tables;
+OK
+demo_src
+demo_tgt
+Time taken: 0.04 seconds, Fetched: 2 row(s)
+```
+
+Check table definition.
+```bash
+hive> show create table demo_src;
+OK
+CREATE EXTERNAL TABLE `demo_src`(
+  `id` bigint, 
+  `age` int, 
+  `desc` string)
+PARTITIONED BY ( 
+  `dt` string, 
+  `hour` string)
+ROW FORMAT SERDE 
+  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
+WITH SERDEPROPERTIES ( 
+  'field.delim'='|', 
+  'serialization.format'='|') 
+STORED AS INPUTFORMAT 
+  'org.apache.hadoop.mapred.TextInputFormat' 
+OUTPUTFORMAT 
+  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
+LOCATION
+  'hdfs://127.0.0.1:9000/griffin/data/batch/demo_src'
+TBLPROPERTIES (
+  'transient_lastDdlTime'='1551168613')
+Time taken: 3.762 seconds, Fetched: 20 row(s)
+```
+
+If the table definition is not correct, drop it.
+```bash
+hive> drop table if exists demo_src;
+OK
+Time taken: 3.764 seconds
+hive> drop table if exists demo_tgt;
+OK
+Time taken: 0.632 seconds
+```
 
-  ```
-  http://<your IP>:8080
-  ```
+* **spawn demo data**
+There has been a script spawning test data, you can fetch it from [batch 
data](http://griffin.apache.org/data/batch/).
+And then execute ./gen_demo_data.sh to get the two data source files.
+```bash
+/apache/data/demo$ wget http://griffin.apache.org/data/batch/gen_demo_data.sh
+/apache/data/demo$ wget http://griffin.apache.org/data/batch/gen_delta_src.sh
+/apache/data/demo$ wget http://griffin.apache.org/data/batch/demo_basic
+/apache/data/demo$ wget http://griffin.apache.org/data/batch/delta_tgt
+/apache/data/demo$ wget 
http://griffin.apache.org/data/batch/insert-data.hql.template
+/apache/data/demo$ chmod 755 *.sh
+/apache/data/demo$ ./gen_demo_data.sh
+```
+
+Create gen-hive-data.sh
+```
+#!/bin/bash
+
+#create table
+hive -f create-table.hql
+echo "create table done"
+
+#current hour
+sudo ./gen_demo_data.sh
+cur_date=`date +%Y%m%d%H`
+dt=${cur_date:0:8}
+hour=${cur_date:8:2}
+partition_date="dt='$dt',hour='$hour'"
+sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > 
insert-data.hql
+hive -f insert-data.hql
+src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
+tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
+hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
+hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
+hadoop fs -touchz ${src_done_path}
+hadoop fs -touchz ${tgt_done_path}
+echo "insert data [$partition_date] done"
+
+#last hour
+sudo ./gen_demo_data.sh
+cur_date=`date -d '1 hour ago' +%Y%m%d%H`
+dt=${cur_date:0:8}
+hour=${cur_date:8:2}
+partition_date="dt='$dt',hour='$hour'"
+sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > 
insert-data.hql
+hive -f insert-data.hql
+src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
+tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
+hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
+hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
+hadoop fs -touchz ${src_done_path}
+hadoop fs -touchz ${tgt_done_path}
+echo "insert data [$partition_date] done"
+
+#next hours
+set +e
+while true
+do
+  sudo ./gen_demo_data.sh
+  cur_date=`date +%Y%m%d%H`
+  next_date=`date -d "+1hour" '+%Y%m%d%H'`
+  dt=${next_date:0:8}
+  hour=${next_date:8:2}
+  partition_date="dt='$dt',hour='$hour'"
+  sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > 
insert-data.hql
+  hive -f insert-data.hql
+  src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
+  tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
+  hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
+  hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
+  hadoop fs -touchz ${src_done_path}
+  hadoop fs -touchz ${tgt_done_path}
+  echo "insert data [$partition_date] done"
+  sleep 3600
+done
+set -e
+```
+
+Then we will load data into both two tables for every hour.
+```bash
+/apache/data/demo$ ./gen-hive-data.sh
+```
 
-You can use UI following the steps [here](../ui/user-guide.md).
+After a while, you can query demo data from hive table.
+```bash
+hive> select * from demo_src;
+124    935     935     20190226        17
+124    838     838     20190226        17
+124    631     631     20190226        17
+......
+Time taken: 2.19 seconds, Fetched: 375000 row(s)
+```
+
+See related data folder created on hdfs. 
+```bash
+$ hdfs dfs -ls /griffin/data/batch
+drwxr-xr-x   - king supergroup          0 2019-02-26 16:13 
/griffin/data/batch/demo_src
+drwxr-xr-x   - king supergroup          0 2019-02-26 16:13 
/griffin/data/batch/demo_tgt
+
+$ hdfs dfs -ls /griffin/data/batch/demo_src/
+drwxr-xr-x   - king supergroup          0 2019-02-26 16:14 
/griffin/data/batch/demo_src/dt=20190226
+```
+
+You need to create Elasticsearch index in advance, in order to set number of 
shards, replicas, and other settings to desired values:
+```
+curl -k -H "Content-Type: application/json" -X PUT 
http://127.0.0.1:9200/griffin \
+ -d '{
+    "aliases": {},
+    "mappings": {
+        "accuracy": {
+            "properties": {
+                "name": {
+                    "fields": {
+                        "keyword": {
+                            "ignore_above": 256,
+                            "type": "keyword"
+                        }
+                    },
+                    "type": "text"
+                },
+                "tmst": {
+                    "type": "date"
+                }
+            }
+        }
+    },
+    "settings": {
+        "index": {
+            "number_of_replicas": "2",
+            "number_of_shards": "5"
+        }
+    }
+}'
+```
+You can access http://127.0.0.1:9200/griffin to verify configuration.
 
-**Note**: The UI does not support all the backend features, to experience the 
advanced features you can use services directly.
+Everything is ready, you can login http://127.0.0.1:8080 without username and 
credentials. And then create measure, job to validate data quality by [user 
guide](../ui/user-guide.md).
\ No newline at end of file

[griffin] branch master updated: Fix deployment doc

Reply via email to