http://git-wip-us.apache.org/repos/asf/eagle/blob/0ecb7c1c/eagle-site/tutorial-topologymanagement.md ---------------------------------------------------------------------- diff --git a/eagle-site/tutorial-topologymanagement.md b/eagle-site/tutorial-topologymanagement.md new file mode 100644 index 0000000..238ba3d --- /dev/null +++ b/eagle-site/tutorial-topologymanagement.md @@ -0,0 +1,143 @@ +--- +layout: doc +title: "Topology Management" +permalink: /docs/tutorial/topologymanagement.html +--- +*Since Apache Eagle 0.4.0-incubating. Apache Eagle will be called Eagle in the following.* + +> Application manager aims to manage applications on EAGLE UI. Users can easily start/start topologies remotely or locally without any shell commands. At the same, it should be capable to sync the latest status of topologies on the execution platform (e.g., Storm[^STORM] cluster). + +This tutorial will go through all parts of application manager and then give an example to use it. + +### Design +Application manager consists of a daemon scheduler and an execution module. The scheduler periodically loads user operations(start/stop) from database, and the execution module executes these operations. For more details, please refer to [here](https://cwiki.apache.org/confluence/display/EAG/Application+Management). + +### Configurations +The configuration file `eagle-scheduler.conf` defines scheduler parameters, execution platform settings and parts of default topology configuration. + +* **Scheduler properties** + + <style> + table, td, th { + border-collapse: collapse; + border: 1px solid gray; + padding: 10px; + } + </style> + + + Property Name | Default | Description + ------------- | :-------------: | ----------- + appCommandLoaderEnabled | false | topology management is enabled or not + appCommandLoaderIntervalSecs | 1 | defines the interval of the scheduler loads commands + appHealthCheckIntervalSecs | 5 | define the interval of topology health checking, which syncs the topology execution status from storm cluster to Eagle + + + +* **Execution platform properties** + + Property Name | Default | Description + ------------- | :------------- | ----------- + envContextConfig.env | storm | execution environment, only storm is supported + envContextConfig.url | http://sandbox.hortonworks.com:8744 | storm ui + envContextConfig.nimbusHost | sandbox.hortonworks.com | storm nimbus host + envContextConfig.nimbusThriftPort | 6627 | storm nimbus thrift port + envContextConfig.jarFile | TODO | storm fat jar path + +* **Topology default properties** + + Some default topology properties are defined here. + + +### Playbook + +1. Editing eagle-scheduler.conf, and start Eagle service + + # enable application manager + appCommandLoaderEnabled = true + + # provide jar path + envContextConfig.jarFile = + + # storm nimbus + envContextConfig.url = "http://sandbox.hortonworks.com:8744" + envContextConfig.nimbusHost = "sandbox.hortonworks.com" + + + + + For more configurations, please back to [Application Configuration](/docs/configuration.html). <br /> + After the configuration is ready, start Eagle service `bin/eagle-service.sh start`. + +2. Go to admin page +  +  + +3. Go to management page, and create a topology description. There are three required fields + * name: topology name + * type: topology type [CLASS, DYNAMIC] + * execution entry: either the class which implements interface TopologyExecutable or eagle [DSL](https://github.com/apache/eagle/blob/master/eagle-assembly/src/main/conf/sandbox-hadoopjmx-pipeline.conf) based topology definition +  + +4. Back to monitoring page, and choose the site/application to deploy the topology +  + +5. Go to site page, and add topology configurations. + + **NOTICE** topology configurations defined here are REQUIRED an extra prefix `.app` + + Blow are some example configurations for [site=sandbox, applicatoin=hbaseSecurityLog]. + + + + classification.hbase.zookeeper.property.clientPort=2181 + classification.hbase.zookeeper.quorum=sandbox.hortonworks.com + + app.envContextConfig.env=storm + app.envContextConfig.mode=cluster + + app.dataSourceConfig.topic=sandbox_hbase_security_log + app.dataSourceConfig.zkConnection=sandbox.hortonworks.com:2181 + app.dataSourceConfig.zkConnectionTimeoutMS=15000 + app.dataSourceConfig.brokerZkPath=/brokers + app.dataSourceConfig.fetchSize=1048586 + app.dataSourceConfig.transactionZKServers=sandbox.hortonworks.com + app.dataSourceConfig.transactionZKPort=2181 + app.dataSourceConfig.transactionZKRoot=/consumers + app.dataSourceConfig.consumerGroupId=eagle.hbasesecurity.consumer + app.dataSourceConfig.transactionStateUpdateMS=2000 + app.dataSourceConfig.deserializerClass=org.apache.eagle.security.hbase.parse.HbaseAuditLogKafkaDeserializer + + app.eagleProps.site=sandbox + app.eagleProps.application=hbaseSecurityLog + app.eagleProps.dataJoinPollIntervalSec=30 + app.eagleProps.mailHost=some.mail.server + app.eagleProps.mailSmtpPort=25 + app.eagleProps.mailDebug=true + app.eagleProps.eagleService.host=localhost + app.eagleProps.eagleService.port=9099 + app.eagleProps.eagleService.username=admin + app.eagleProps.eagleService.password=secret + + +  +  + +6. Go to monitoring page, and start topologies +  +  + +7. stop topologies on monitoring page +  +  +  + + + + +--- + +#### *Footnotes* + +[^STORM]:*All mentions of "storm" on this page represent Apache Storm.* +
http://git-wip-us.apache.org/repos/asf/eagle/blob/0ecb7c1c/eagle-site/tutorial-userprofile.md ---------------------------------------------------------------------- diff --git a/eagle-site/tutorial-userprofile.md b/eagle-site/tutorial-userprofile.md new file mode 100644 index 0000000..e3553b2 --- /dev/null +++ b/eagle-site/tutorial-userprofile.md @@ -0,0 +1,65 @@ +--- +layout: doc +title: "User Profile Tutorial" +permalink: /docs/tutorial/userprofile.html +--- +This document will introduce how to start the online processing on user profiles. Assume Apache Eagle has been installed and [Eagle service](http://sandbox.hortonworks.com:9099/eagle-service) +is started. + +### User Profile Offline Training + +* **Step 1**: Start Apache Spark if not started + + +* **Step 2**: start offline scheduler + + * Option 1: command line + + $ cd <eagle-home>/bin + $ bin/eagle-userprofile-scheduler.sh --site sandbox start + + * Option 2: start via Apache Ambari +  + +* **Step 3**: generate a model + +  +  +  +  + +### User Profile Online Detection + +Two options to start the topology are provided. + +* **Option 1**: command line + + submit userProfiles topology if it's not on [topology UI](http://sandbox.hortonworks.com:8744) + + $ bin/eagle-topology.sh --main org.apache.eagle.security.userprofile.UserProfileDetectionMain --config conf/sandbox-userprofile-topology.conf start + +* **Option 2**: Apache Ambari + +  + +### Evaluate User Profile in Sandbox + +1. Prepare sample data for ML training and validation sample data +* a. Download following sample data to be used for training + * [`user1.hdfs-audit.2015-10-11-00.txt`](/data/user1.hdfs-audit.2015-10-11-00.txt) + * [`user1.hdfs-audit.2015-10-11-01.txt`](/data/user1.hdfs-audit.2015-10-11-01.txt) +* b. Downlaod [`userprofile-validate.txt`](/data/userprofile-validate.txt)file which contains data points that you can try to test the models + +2. Copy the files (downloaded in the previous step) into a location in sandbox +For example: `/usr/hdp/current/eagle/lib/userprofile/data/` +3. Modify `<Eagle-home>/conf/sandbox-userprofile-scheduler.conf ` +update `training-audit-path` to set to the path for training data sample (the path you used for Step 1.a) +update detection-audit-path to set to the path for validation (the path you used for Step 1.b) +4. Run ML training program from eagle UI +5. Produce Apache Kafka data using the contents from validate file (Step 1.b) +Run the command (assuming the eagle configuration uses Kafka topic `sandbox_hdfs_audit_log`) + + ./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic sandbox_hdfs_audit_log + +6. Paste few lines of data from file validate onto kafka-console-producer +Check [http://localhost:9099/eagle-service/#/dam/alertList](http://localhost:9099/eagle-service/#/dam/alertList) for generated alerts http://git-wip-us.apache.org/repos/asf/eagle/blob/0ecb7c1c/eagle-site/usecases.md ---------------------------------------------------------------------- diff --git a/eagle-site/usecases.md b/eagle-site/usecases.md new file mode 100644 index 0000000..4172ad8 --- /dev/null +++ b/eagle-site/usecases.md @@ -0,0 +1,48 @@ +--- +layout: doc +title: "Use Cases" +permalink: /docs/usecases.html +--- + +### Data Activity Monitoring + +* Data activity represents how user explores data provided by big data platforms. Analyzing data activity and alerting for insecure access are fundamental requirements for securing enterprise data. As data volume is increasing exponentially with Hadoop[^HADOOP], Hive[^HIVE], Spark[^SPARK] technology, understanding data activities for every user becomes extremely hard, let alone to alert for a single malicious event in real time among petabytes streaming data per day. + +* Securing enterprise data starts from understanding data activities for every user. Apache Eagle (called Eagle in the following) has integrated with many popular big data platforms e.g. Hadoop, Hive, Spark, Cassandra[^CASSANDRA] etc. With Eagle user can browse data hierarchy, mark sensitive data and then create comprehensive policy to alert for insecure data access. + +### Job Performance Analytics + +* Running map/reduce job is the most popular way people use to analyze data in Hadoop system. Analyzing job performance and providing tuning suggestions are critical for Hadoop system stability, job SLA and resource usage etc. + +* Eagle analyzes job performance with two complementing approaches. First Eagle periodically takes snapshots for all running jobs with YARN API, secondly Eagle continuously reads job lifecycle events immediately after the job is completed. With the two approaches, Eagle can analyze single job's trend, data skew problem, failure reasons etc. More interestingly, Eagle can analyze whole Hadoop cluster's performance by taking into account all jobs. + +### Node Anomaly Detection + +* One of practical benefits from analyzing map/reduce job is node anomaly detection. Big data platform like Hadoop may involve thousands of nodes for supporting multi-tenant jobs. One bad node may not crash whole cluster thanks to failure tolerance design, but may affect specific jobs and cause a lot of rescheduling, job delay and hurt stability of whole cluster etc. + +* Eagle developed out-of-the-box algorithm to compare task failure ratio for each node in a large cluster. If one node continues to fail running tasks, it may have potential issues, sometimes one of its disks is full or fails etc. In a nutshell, if one node behaves very differently from all other nodes within one large cluster, this node is anomalous and we should take action. + +### Cluster Performance Analytics + +* It is critical to understand why a cluster performs bad. Is that because of some crazy jobs recently onboarded, or huge amount of tiny files, or namenode performance degrading? + +* Eagle in realtime calculates resource usage per minute out of individual jobs, e.g. CPU, memory, HDFS IO bytes, HDFS IO numOps etc. and also collects namenode JMX metrics. Correlating them together will easily help system adminstrator find root cause for cluster slowness. + +### Cluster Resource Usage Trend + +* YARN manages resource allocation through queue in a large Hadoop cluster. Cluster resource usage is exactly reflected by overall queue usage. + +* Eagle in realtime collects queue statistics and provide insights of cluster resource usage. + + + +--- + +#### *Footnotes* + +[^HADOOP]:*All mentions of "hadoop" on this page represent Apache Hadoop.* +[^HIVE]:*All mentions of "hive" on this page represent Apache Hive.* +[^SPARK]:*All mentions of "spark" on this page represent Apache Spark.* +[^CASSANDRA]:*Apache Cassandra.* + + http://git-wip-us.apache.org/repos/asf/eagle/blob/0ecb7c1c/eagle-site/user-profile-ml.md ---------------------------------------------------------------------- diff --git a/eagle-site/user-profile-ml.md b/eagle-site/user-profile-ml.md new file mode 100644 index 0000000..5c0a6b8 --- /dev/null +++ b/eagle-site/user-profile-ml.md @@ -0,0 +1,22 @@ +--- +layout: doc +title: "User Profile Machine Learning" +permalink: /docs/user-profile-ml.html +--- + +Apache Eagle (called Eagle in the following) provides capabilities to define user activity patterns or user profiles for Apache Hadoop users based on the user behavior in the platform. The idea is to provide anomaly detection capability without setting hard thresholds in the system. The user profiles generated by our system are modeled using machine-learning algorithms and used for detection of anomalous user activities, where usersâ activity pattern differs from their pattern history. Currently Eagle uses two algorithms for anomaly detection: Eigen-Value Decomposition and Density Estimation. The algorithms read data from HDFS audit logs, slice and dice data, and generate models for each user in the system. Once models are generated, Eagle uses the Apache Storm framework for near-real-time anomaly detection to determine if current user activities are suspicious or not with respect to their model. The block diagram below shows the current pipeline for user profile training and onli ne detection. + + + +Eagle online anomaly detection uses the Eagle policy framework, and the user profile is defined as one of the policies in the system. The user profile policy is evaluated by a machine-learning evaluator extended from the Eagle policy evaluator. Policy definition includes the features that are needed for anomaly detection (same as the ones used for training purposes). + +A scheduler runs a Apache Spark based offline training program (to generate user profiles or models) at a configurable time interval; currently, the training program generates new models once every month. + +The following are some details on the algorithms. + +* **Density Estimation**: In this algorithm, the idea is to evaluate, for each user, a probability density function from the observed training data sample. We mean-normalize a training dataset for each feature. Normalization allows datasets to be on the same scale. In our probability density estimation, we use a Gaussian distribution function as the method for computing probability density. Features are conditionally independent of one another; therefore, the final Gaussian probability density can be computed by factorizing each featureâs probability density. During the online detection phase, we compute the probability of a userâs activity. If the probability of the user performing the activity is below threshold (determined from the training program, using a method called Mathews Correlation Coefficient), we signal anomaly alerts. +* **Eigen-Value Decomposition**: Our goal in user profile generation is to find interesting behavioral patterns for users. One way to achieve that goal is to consider a combination of features and see how each one influences the others. When the data volume is large, which is generally the case for us, abnormal patterns among features may go unnoticed due to the huge number of normal patterns. As normal behavioral patterns can lie within very low-dimensional subspace, we can potentially reduce the dimension of the dataset to better understand the user behavior pattern. This method also reduces noise, if any, in the training dataset. Based on the amount of variance of the data we maintain for a user, which is usually 95% for our case, we seek to find the number of principal components k that represents 95% variance. We consider first k principal components as normal subspace for the user. The remaining (n-k) principal components are considered as abnormal subspace. + +During online anomaly detection, if the user behavior lies near normal subspace, we consider the behavior to be normal. On the other hand, if the user behavior lies near the abnormal subspace, we raise an alarm as we believe usual user behavior should generally fall within normal subspace. We use the Euclidian distance method to compute whether a userâs current activity is near normal or abnormal subspace. + +
