http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/contents/introduction/gearpump-internals.md ---------------------------------------------------------------------- diff --git a/docs/contents/introduction/gearpump-internals.md b/docs/contents/introduction/gearpump-internals.md new file mode 100644 index 0000000..bc7e6bf --- /dev/null +++ b/docs/contents/introduction/gearpump-internals.md @@ -0,0 +1,228 @@ +### Actor Hierarchy? + + + +Everything in the diagram is an actor; they fall into two categories, Cluster Actors and Application Actors. + +#### Cluster Actors + + **Worker**: Maps to a physical worker machine. It is responsible for managing resources and report metrics on that machine. + + **Master**: Heart of the cluster, which manages workers, resources, and applications. The main function is delegated to three child actors, App Manager, Worker Manager, and Resource Scheduler. + +#### Application Actors: + + **AppMaster**: Responsible to schedule the tasks to workers and manage the state of the application. Different applications have different AppMaster instances and are isolated. + + **Executor**: Child of AppMaster, represents a JVM process. Its job is to manage the life cycle of tasks and recover the tasks in case of failure. + + **Task**: Child of Executor, does the real job. Every task actor has a global unique address. One task actor can send data to any other task actors. This gives us great flexibility of how the computation DAG is distributed. + + All actors in the graph are weaved together with actor supervision, and actor watching and every error is handled properly via supervisors. In a master, a risky job is isolated and delegated to child actors, so it's more robust. In the application, an extra intermediate layer "Executor" is created so that we can do fine-grained and fast recovery in case of task failure. A master watches the lifecycle of AppMaster and worker to handle the failures, but the life cycle of Worker and AppMaster are not bound to a Master Actor by supervision, so that Master node can fail independently. Several Master Actors form an Akka cluster, the Master state is exchanged using the Gossip protocol in a conflict-free consistent way so that there is no single point of failure. With this hierarchy design, we are able to achieve high availability. + +### Application Clock and Global Clock Service + +Global clock service will track the minimum time stamp of all pending messages in the system. Every task will update its own minimum-clock to global clock service; the minimum-clock of task is decided by the minimum of: + + - Minimum time stamp of all pending messages in the inbox. + - Minimum time stamp of all un-acked outgoing messages. When there is message loss, the minimum clock will not advance. + - Minimum clock of all task states. If the state is accumulated by a lot of input messages, then the clock value is decided by the oldest message's timestamp. The state clock will advance by doing snapshots to persistent storage or by fading out the effect of old messages. + + + +The global clock service will keep track of all task minimum clocks effectively and maintain a global view of minimum clock. The global minimum clock value is monotonically increasing; it means that all source messages before this clock value have been processed. If there is message loss or task crash, the global minimum clock will stop. + +### How do we optimize the message passing performance? + +For streaming application, message passing performance is extremely important. For example, one streaming platform may need to process millions of messages per second with millisecond level latency. High throughput and low latency is not that easy to achieve. There are a number of challenges: + +#### First Challenge: Network is not efficient for small messages + +In streaming, typical message size is very small, usually less than 100 bytes per message, like the floating car GPS data. But network efficiency is very bad when transferring small messages. As you can see in below diagram, when message size is 50 bytes, it can only use 20% bandwidth. How to improve the throughput? + + + +#### Second Challenge: Message overhead is too big + +For each message sent between two actors, it contains sender and receiver actor path. When sending over the wire, the overhead of this ActorPath is not trivial. For example, the below actor path takes more than 200 bytes. + + :::bash + akka.tcp://[email protected]:51582/remote/akka.tcp/[email protected]:48948/remote/akka.tcp/[email protected]:43676/user/master/Worker1/app_0_executor_0/group_1_task_0#-768886794 + + +#### How do we solve this? + +We implement a custom Netty transportation layer with Akka extension. In the below diagram, Netty Client will translate ActorPath to TaskId, and Netty Server will translate it back. Only TaskId will be passed on wire, it is only about 10 bytes, the overhead is minimized. Different Netty Client Actors are isolated; they will not block each other. + + + +For performance, effective batching is really the key! We group multiple messages to a single batch and send it on the wire. The batch size is not fixed; it is adjusted dynamically based on network status. If the network is available, we will flush pending messages immediately without waiting; otherwise we will put the message in a batch and trigger a timer to flush the batch later. + +### How do we do flow Control? + +Without flow control, one task can easily flood another task with too many messages, causing out of memory error. Typical flow control will use a TCP-like sliding window, so that source and target can run concurrently without blocking each other. + + +Figure: Flow control, each task is "star" connected to input tasks and output tasks + +The difficult part for our problem is that each task can have multiple input tasks and output tasks. The input and output must be geared together so that the back pressure can be properly propagated from downstream to upstream. The flow control also needs to consider failures, and it needs to be able to recover when there is message loss. +Another challenge is that the overhead of flow control messages can be big. If we ack every message, there will be huge amount of acked messages in the system, degrading streaming performance. The approach we adopted is to use explicit AckRequest message. The target tasks will only ack back when they receive the AckRequest message, and the source will only send AckRequest when it feels necessary. With this approach, we can largely reduce the overhead. + +### How do we detect message loss? + +For example, for web ads, we may charge for every click, we don't want to miscount. The streaming platform needs to effectively track what messages have been lost, and recover as fast as possible. + + +Figure: Message Loss Detection + +We use the flow control message AckRequest and Ack to detect message loss. The target task will count how many messages has been received since last AckRequest, and ack the count back to source task. The source task will check the count and find message loss. +This is just an illustration, the real case is more difficulty, we need to handle zombie tasks, and in-the-fly stale messages. + +### How Gearpump know what messages to replay? + +In some applications, a message cannot be lost, and must be replayed. For example, during the money transfer, the bank will SMS us the verification code. If that message is lost, the system must replay it so that money transfer can continue. We made the decision to use **source end message storage** and **time stamp based replay**. + + +Figure: Replay with Source End Message Store + +Every message is immutable, and tagged with a timestamp. We have an assumption that the timestamp is approximately incremental (allow small ratio message disorder). + +We assume the message is coming from a replay-able source, like Kafka queue; otherwise the message will be stored at customizable source end "message store". When the source task sends the message downstream, the timestamp and offset of the message is also check-pointed to offset-timestamp storage periodically. During recovery, the system will first retrieve the right time stamp and offset from the offset-timestamp storage, then it will replay the message store from that time stamp and offset. A Timestamp Filter will filter out old messages in case the message in message store is not strictly time-ordered. + +### Master High Availability + +In a distributed streaming system, any part can fail. The system must stay responsive and do recovery in case of errors. + + +Figure: Master High Availability + +We use Akka clustering to implement the Master high availability. The cluster consists of several master nodes, but no worker nodes. With clustering facilities, we can easily detect and handle the failure of master node crash. The master state is replicated on all master nodes with the Typesafe akka-data-replication library, when one master node crashes, another standby master will read the master state and take over. The master state contains the submission data of all applications. If one application dies, a master can use that state to recover that application. CRDT LwwMap is used to represent the state; it is a hash map that can converge on distributed nodes without conflict. To have strong data consistency, the state read and write must happen on a quorum of master nodes. + +### How we do handle failures? + +With Akka's powerful actor supervision, we can implement a resilient system relatively easy. In Gearpump, different applications have a different AppMaster instance, they are totally isolated from each other. For each application, there is a supervision tree, AppMaster->Executor->Task. With this supervision hierarchy, we can free ourselves from the headache of zombie process, for example if AppMaster is down, Akka supervisor will ensure the whole tree is shutting down. + +There are multiple possible failure scenarios + + +Figure: Possible Failure Scenarios and Error Supervision Hierarchy + +#### What happens when the Master crashes? + +In case of a master crash, other standby masters will be notified, they will resume the master state, and take over control. Worker and AppMaster will also be notified, They will trigger a process to find the new active master, until the resolution complete. If AppMaster or Worker cannot resolve a new Master in a time out, they will make suicide and kill themselves. + +#### What happens when a worker crashes? + +In case of a worker crash, the Master will get notified and stop scheduling new computation to this worker. All supervised executors on current worker will be killed, AppMaster can treat it as recovery of executor crash like [What happen when an executor crashes?](#what-happen-when-an-executor-crashes) + +#### What happens when the AppMaster crashes? + +If an AppMaster crashes, Master will schedule a new resource to create a new AppMaster Instance elsewhere, and then the AppMaster will handle the recovery inside the application. For streaming, it will recover the latest min clock and other state from disk, request resources from master to start executors, and restart the tasks with recovered min clock. + +#### What happen when an executor crashes? + +If an executor crashes, its supervisor AppMaster will get notified, and request a new resource from the active master to start a new executor, to run the tasks which were located on the crashed executor. + +#### What happen when tasks crash? + +If a task throws an exception, its supervisor executor will restart that Task. + +When "at least once" message delivery is enabled, it will trigger the message replaying in the case of message loss. First AppMaster will read the latest minimum clock from the global clock service(or clock storage if the clock service crashes), then AppMaster will restart all the task actors to get a fresh task state, then the source end tasks will replay messages from that minimum clock. + +### How does "exactly-once" message delivery work? + +For some applications, it is extremely important to do "exactly once" message delivery. For example, for a real-time billing system, we will not want to bill the customer twice. The goal of "exactly once" message delivery is to make sure: + The error doesn't accumulate, today's error will not be accumulated to tomorrow. + Transparent to application developer +We use global clock to synchronize the distributed transactions. We assume every message from the data source will have a unique timestamp, the timestamp can be a part of the message body, or can be attached later with system clock when the message is injected into the streaming system. With this global synchronized clock, we can coordinate all tasks to checkpoint at same timestamp. + + +Figure: Checkpointing and Exactly-Once Message delivery + +Workflow to do state checkpointing: + +1. The coordinator asks the streaming system to do checkpoint at timestamp Tc. +2. For each application task, it will maintain two states, checkpoint state and current state. Checkpoint state only contains information before timestamp Tc. Current state contains all information. +3. When global minimum clock is larger than Tc, it means all messages older than Tc has been processed; the checkpoint state will no longer change, so we will then persist the checkpoint state to storage safely. +4. When there is message loss, we will start the recovery process. +5. To recover, load the latest checkpoint state from store, and then use it to restore the application status. +6. Data source replays messages from the checkpoint timestamp. + +The checkpoint interval is determined by global clock service dynamically. Each data source will track the max timestamp of input messages. Upon receiving min clock updates, the data source will report the time delta back to global clock service. The max time delta is the upper bound of the application state timespan. The checkpoint interval is bigger than max delta time: + + + + +Figure: How to determine Checkpoint Interval + +After the checkpoint interval is notified to tasks by global clock service, each task will calculate its next checkpoint timestamp autonomously without global synchronization. + + + +For each task, it contains two states, checkpoint state and current state. The code to update the state is shown in listing below. + + :::python + TaskState(stateStore, initialTimeStamp): + currentState = stateStore.load(initialTimeStamp) + checkpointState = currentState.clone + checkpointTimestamp = nextCheckpointTimeStamp(initialTimeStamp) + onMessage(msg): + if (msg.timestamp < checkpointTimestamp): + checkpointState.updateMessage(msg) + currentState.updateMessage(msg) + maxClock = max(maxClock, msg.timeStamp) + + onMinClock(minClock): + if (minClock > checkpointTimestamp): + stateStore.persist(checkpointState) + checkpointTimeStamp = nextCheckpointTimeStamp(maxClock) + checkpointState = currentState.clone + + onNewCheckpointInterval(newStep): + step = newStep + nextCheckpointTimeStamp(timestamp): + checkpointTimestamp = (1 + timestamp/step) * step + + +List 1: Task Transactional State Implementation + +### What is dynamic graph, and how it works? + +The DAG can be modified dynamically. We want to be able to dynamically add, remove, and replace a sub-graph. + + +Figure: Dynamic Graph, Attach, Replace, and Remove + +## At least once message delivery and Kafka + +The Kafka source example project and tutorials can be found at: +- [Kafka connector example project](https://github.com/apache/incubator-gearpump/tree/master/examples/streaming/kafka) +- [Connect with Kafka source](../dev/dev-connectors) + +In this doc, we will talk about how the at least once message delivery works. + +We will use the WordCount example of [source tree](https://github.com/apache/incubator-gearpump/tree/master/examples/streaming/kafka) to illustrate. + +### How the kafka WordCount DAG looks like: + +It contains three processors: + + +- KafkaStreamProducer(or KafkaSource) will read message from kafka queue. +- Split will split lines to words +- Sum will summarize the words to get a count for each word. + +### How to read data from Kafka + +We use KafkaSource, please check [Connect with Kafka source](../dev/dev-connectors) for the introduction. + +Please note that we have set a startTimestamp for the KafkaSource, which means KafkaSource will read from Kafka queue starting from messages whose timestamp is near startTimestamp. + +### What happen where there is Task crash or message loss? +When there is message loss, the AppMaster will first pause the global clock service so that the global minimum timestamp no longer change, then it will restart the Kafka source tasks. Upon restart, Kafka Source will start to replay. It will first read the global minimum timestamp from AppMaster, and start to read message from that timestamp. + +### What method KafkaSource used to read messages from a start timestamp? As we know Kafka queue doesn't expose the timestamp information. + +Kafka queue only expose the offset information for each partition. What KafkaSource do is to maintain its own mapping from Kafka offset to Application timestamp, so that we can map from a application timestamp to a Kafka offset, and replay Kafka messages from that Kafka offset. + +The mapping between Application timestamp with Kafka offset is stored in a distributed file system or as a Kafka topic.
http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/contents/introduction/message-delivery.md ---------------------------------------------------------------------- diff --git a/docs/contents/introduction/message-delivery.md b/docs/contents/introduction/message-delivery.md new file mode 100644 index 0000000..1ffeb9e --- /dev/null +++ b/docs/contents/introduction/message-delivery.md @@ -0,0 +1,47 @@ +## What is At Least Once Message Delivery? + +Messages could be lost on delivery due to network partitions. **At Least Once Message Delivery** (at least once) means the lost messages are delivered one or more times such that at least one is processed and acknowledged by the whole flow. + +Gearpump guarantees at least once for any source that is able to replay message from a past timestamp. In Gearpump, each message is tagged with a timestamp, and the system tracks the minimum timestamp of all pending messages (the global minimum clock). On message loss, application will be restarted to the global minimum clock. Since the source is able to replay from the global minimum clock, all pending messages before the restart will be replayed. Gearpump calls that kind of source `TimeReplayableSource` and already provides a built in +[KafkaSource](gearpump-internals#at-least-once-message-delivery-and-kafka). With the KafkaSource to ingest data into Gearpump, users are guaranteed at least once message delivery. + +## What is Exactly Once Message Delivery? + +At least once delivery doesn't guarantee the correctness of the application result. For instance, for a task keeping the count of received messages, there could be overcount with duplicated messages and the count is lost on task failure. + In that case, **Exactly Once Message Delivery** (exactly once) is required, where state is updated by a message exactly once. This further requires that duplicated messages are filtered out and in-memory states are persisted. + +Users are guaranteed exactly once in Gearpump if they use both a `TimeReplayableSource` to ingest data and the Persistent API to manage their in memory states. With the Persistent API, user state is periodically checkpointed by the system to a persistent store (e.g HDFS) along with its checkpointed time. Gearpump tracks the global minimum checkpoint timestamp of all pending states (global minimum checkpoint clock), which is persisted as well. On application restart, the system restores states at the global minimum checkpoint clock and source replays messages from that clock. This ensures that a message updates all states exactly once. + +### Persistent API +Persistent API consists of `PersistentTask` and `PersistentState`. + +Here is an example of using them to keep count of incoming messages. + + :::scala + class CountProcessor(taskContext: TaskContext, conf: UserConfig) + extends PersistentTask[Long](taskContext, conf) { + + override def persistentState: PersistentState[Long] = { + import com.twitter.algebird.Monoid.longMonoid + new NonWindowState[Long](new AlgebirdMonoid(longMonoid), new ChillSerializer[Long]) + } + + override def processMessage(state: PersistentState[Long], message: Message): Unit = { + state.update(message.timestamp, 1L) + } + } + + +The `CountProcessor` creates a customized `PersistentState` which will be managed by `PersistentTask` and overrides the `processMessage` method to define how the state is updated on a new message (each new message counts as `1`, which is added to the existing value) + +Gearpump has already offered two types of states + +1. NonWindowState - state with no time or other boundary +2. WindowState - each state is bounded by a time window + +They are intended for states that satisfy monoid laws. + +1. has binary associative operation, like `+` +2. has an identity element, like `0` + +In the above example, we make use of the `longMonoid` from [Twitter's Algebird](https://github.com/twitter/algebird) library which provides a bunch of useful monoids. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/contents/introduction/performance-report.md ---------------------------------------------------------------------- diff --git a/docs/contents/introduction/performance-report.md b/docs/contents/introduction/performance-report.md new file mode 100644 index 0000000..e4254d1 --- /dev/null +++ b/docs/contents/introduction/performance-report.md @@ -0,0 +1,34 @@ +# Performance Evaluation + +To illustrate the performance of Gearpump, we mainly focused on two aspects, throughput and latency, using a micro benchmark called SOL (an example in the Gearpump package) whose topology is quite simple. SOLStreamProducer delivers messages to SOLStreamProcessor constantly and SOLStreamProcessor does nothing. We set up a 4-nodes cluster with 10GbE network and each node's hardware is briefly shown as follows: + +Processor: 32 core Intel(R) Xeon(R) CPU E5-2690 2.90GHz +Memory: 64GB + +## Throughput + +We tried to explore the upper bound of the throughput, after launching 48 SOLStreamProducer and 48 SOLStreamProcessor the Figure below shows that the whole throughput of the cluster can reach about 18 million messages/second(100 bytes per message) + +## Latency + +When we transfer message at the max throughput above, the average latency between two tasks is 8ms. + +## Fault Recovery time + +When the corruption is detected, for example the Executor is down, Gearpump will reallocate the resource and restart the application. It takes about 10 seconds to recover the application. + + + +## How to setup the benchmark environment? + +### Prepare the env + +1). Set up a 4-nodes Gearpump cluster with 10GbE network which have 4 Workers on each node. In our test environment, each node has 64GB memory and Intel(R) Xeon(R) 32-core processor E5-2690 2.90GHz. Make sure the metrics is enabled in Gearpump. + +2). Submit a SOL application with 48 StreamProducers and 48 StreamProcessors: + + :::bash + bin/gear app -jar ./examples/sol-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}-assembly.jar -streamProducer 48 -streamProcessor 48 + + +3). Launch Gearpump's dashboard and browser http://$HOST:8090/, switch to the Applications tab and you can see the detail information of your application. The HOST should be the node runs dashboard. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/contents/introduction/submit-your-1st-application.md ---------------------------------------------------------------------- diff --git a/docs/contents/introduction/submit-your-1st-application.md b/docs/contents/introduction/submit-your-1st-application.md new file mode 100644 index 0000000..21cfaf2 --- /dev/null +++ b/docs/contents/introduction/submit-your-1st-application.md @@ -0,0 +1,39 @@ +Before you can submit and run your first Gearpump application, you will need a running Gearpump service. +There are multiple ways to run Gearpump [Local mode](../deployment/deployment-local), [Standalone mode](../deployment/deployment-standalone), [YARN mode](../deployment/deployment-yarn) or [Docker mode](../deployment/deployment-docker). + +The easiest way is to run Gearpump in [Local mode](../deployment/deployment-local). +Any Linux, MacOSX or Windows desktop can be used with zero configuration. + +In the example below, we assume your are running in [Local mode](../deployment/deployment-local). +If you running Gearpump in one of the other modes, you will need to configure the Gearpump client to +connect to the Gearpump service by setting the `gear.conf` configuration path in classpath. +Within this file, you will need to change the parameter `gearpump.cluster.masters` to the correct Gearpump master(s). +See [Configuration](../deployment/deployment-configuration) for details. + +## Steps to submit your first Application + +### Step 1: Submit application +After the cluster is started, you can submit an example wordcount application to the cluster + +Open another shell, + + :::bash + ### To run WordCount example + bin/gear app -jar examples/wordcount-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}-assembly.jar org.apache.gearpump.streaming.examples.wordcount.WordCount + + +### Step 2: Congratulations, you've submitted your first application. + +To view the application status and metrics, start the Web UI services, and browse to [http://127.0.0.1:8090](http://127.0.0.1:8090) to check the status. +The default username and password is "admin:admin", you can check +[UI Authentication](../deployment/deployment-ui-authentication) to find how to manage users. + + + +**NOTE:** the UI port setting can be defined in configuration, please check section [Configuration](../deployment/deployment-configuration). + +## A quick Look at the Web UI +TBD + +## Other Application Examples +Besides wordcount, there are several other example applications. Please check the source tree examples/ for detail information. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/api/java.md ---------------------------------------------------------------------- diff --git a/docs/docs/api/java.md b/docs/docs/api/java.md deleted file mode 100644 index 3b94f91..0000000 --- a/docs/docs/api/java.md +++ /dev/null @@ -1 +0,0 @@ -Placeholder http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/api/scala.md ---------------------------------------------------------------------- diff --git a/docs/docs/api/scala.md b/docs/docs/api/scala.md deleted file mode 100644 index 3b94f91..0000000 --- a/docs/docs/api/scala.md +++ /dev/null @@ -1 +0,0 @@ -Placeholder http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-configuration.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-configuration.md b/docs/docs/deployment/deployment-configuration.md deleted file mode 100644 index 1dadbd7..0000000 --- a/docs/docs/deployment/deployment-configuration.md +++ /dev/null @@ -1,84 +0,0 @@ -## Master and Worker configuration - -Master and Worker daemons will only read configuration from `conf/gear.conf`. - -Master reads configuration from section master and gearpump: - - :::bash - master { - } - gearpump{ - } - - -Worker reads configuration from section worker and gearpump: - - :::bash - worker { - } - gearpump{ - } - - -## Configuration for user submitted application job - -For user application job, it will read configuration file `gear.conf` and `application.conf` from classpath, while `application.conf` has higher priority. -The default classpath contains: - -1. `conf/` -2. current working directory. - -For example, you can put a `application.conf` on your working directory, and then it will be effective when you submit a new job application. - -## Logging - -To change the log level, you need to change both `gear.conf`, and `log4j.properties`. - -### To change the log level for master and worker daemon - -Please change `log4j.rootLevel` in `log4j.properties`, `gearpump-master.akka.loglevel` and `gearpump-worker.akka.loglevel` in `gear.conf`. - -### To change the log level for application job - -Please change `log4j.rootLevel` in `log4j.properties`, and `akka.loglevel` in `gear.conf` or `application.conf`. - -## Gearpump Default Configuration - -This is the default configuration for `gear.conf`. - -| config item | default value | description | -| -------------- | -------------- | ---------------- | -| gearpump.hostname | "127.0.0.1" | hostname of current machine. If you are using local mode, then set this to 127.0.0.1. If you are using cluster mode, make sure this hostname can be accessed by other machines. | -| gearpump.cluster.masters | ["127.0.0.1:3000"] | Config to set the master nodes of the cluster. If there are multiple master in the list, then the master nodes runs in HA mode. For example, you may start three master, on node1: `bin/master -ip node1 -port 3000`, on node2: `bin/master -ip node2 -port 3000`, on node3: `bin/master -ip node3 -port 3000`, then you need to set `gearpump.cluster.masters = ["node1:3000","node2:3000","node3:3000"]` | -| gearpump.task-dispatcher | "gearpump.shared-thread-pool-dispatcher" | default dispatcher for task actor | -| gearpump.metrics.enabled | true | flag to enable the metrics system | -| gearpump.metrics.sample-rate | 1 | We will take one sample every `gearpump.metrics.sample-rate` data points. Note it may have impact that the statistics on UI portal is not accurate. Change it to 1 if you want accurate metrics in UI | -| gearpump.metrics.report-interval-ms | 15000 | we will report once every 15 seconds | -| gearpump.metrics.reporter | "akka" | available value: "graphite", "akka", "logfile" which write the metrics data to different places. | -| gearpump.retainHistoryData.hours | 72 | max hours of history data to retain, Note: Due to implementation limitation(we store all history in memory), please don't set this to too big which may exhaust memory. | -| gearpump.retainHistoryData.intervalMs | 3600000 | time interval between two data points for history data (unit: ms). Usually this is set to a big value so that we only store coarse-grain data | -| gearpump.retainRecentData.seconds | 300 | max seconds of recent data to retain. This is for the fine-grain data | -| gearpump.retainRecentData.intervalMs | 15000 | time interval between two data points for recent data (unit: ms) | -| gearpump.log.daemon.dir | "logs" | The log directory for daemon processes(relative to current working directory) | -| gearpump.log.application.dir | "logs" | The log directory for applications(relative to current working directory) | -| gearpump.serializers | a map | custom serializer for streaming application, e.g. `"scala.Array" = ""` | -| gearpump.worker.slots | 1000 | How many slots each worker contains | -| gearpump.appmaster.vmargs | "-server -Xss1M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseParNewGC -XX:NewRatio=3 -Djava.rmi.server.hostname=localhost" | JVM arguments for AppMaster | -| gearpump.appmaster.extraClasspath | "" | JVM default class path for AppMaster | -| gearpump.executor.vmargs | "-server -Xss1M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseParNewGC -XX:NewRatio=3 -Djava.rmi.server.hostname=localhost" | JVM arguments for executor | -| gearpump.executor.extraClasspath | "" | JVM default class path for executor | -| gearpump.jarstore.rootpath | "jarstore/" | Define where the submitted jar file will be stored. This path follows the hadoop path schema. For HDFS, use `hdfs://host:port/path/`, and HDFS HA, `hdfs://namespace/path/`; if you want to store on master nodes, then use local directory. `jarstore.rootpath = "jarstore/"` will point to relative directory where master is started. `jarstore.rootpath = "/jarstore/"` will point to absolute directory on master server | -| gearpump.scheduling.scheduler-class |"org.apache.gearpump.cluster.scheduler.PriorityScheduler" | Class to schedule the applications. | -| gearpump.services.host | "127.0.0.1" | dashboard UI host address | -| gearpump.services.port | 8090 | dashboard UI host port | -| gearpump.netty.buffer-size | 5242880 | netty connection buffer size | -| gearpump.netty.max-retries | 30 | maximum number of retries for a netty client to connect to remote host | -| gearpump.netty.base-sleep-ms | 100 | base sleep time for a netty client to retry a connection. Actual sleep time is a multiple of this value | -| gearpump.netty.max-sleep-ms | 1000 | maximum sleep time for a netty client to retry a connection | -| gearpump.netty.message-batch-size | 262144 | netty max batch size | -| gearpump.netty.flush-check-interval | 10 | max flush interval for the netty layer, in milliseconds | -| gearpump.netty.dispatcher | "gearpump.shared-thread-pool-dispatcher" | default dispatcher for netty client and server | -| gearpump.shared-thread-pool-dispatcher | default Dispatcher with "fork-join-executor" | default shared thread pool dispatcher | -| gearpump.single-thread-dispatcher | PinnedDispatcher | default single thread dispatcher | -| gearpump.serialization-framework | "org.apache.gearpump.serializer.FastKryoSerializationFramework" | Gearpump has built-in serialization framework using Kryo. Users are allowed to use a different serialization framework, like Protobuf. See `org.apache.gearpump.serializer.FastKryoSerializationFramework` to find how a custom serialization framework can be defined | -| worker.executor-share-same-jvm-as-worker | false | whether the executor actor is started in the same jvm(process) from which running the worker actor, the intention of this setting is for the convenience of single machine debugging, however, the app jar need to be added to the worker's classpath when you set it true and have a 'real' worker in the cluster | \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-docker.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-docker.md b/docs/docs/deployment/deployment-docker.md deleted file mode 100644 index c71ed9d..0000000 --- a/docs/docs/deployment/deployment-docker.md +++ /dev/null @@ -1,5 +0,0 @@ -## Gearpump Docker Container - -There is pre-built docker container available at [Docker Repo](https://hub.docker.com/r/gearpump/gearpump/) - -Check the documents there to find how to launch a Gearpump cluster in one line. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-ha.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-ha.md b/docs/docs/deployment/deployment-ha.md deleted file mode 100644 index 9e907c0..0000000 --- a/docs/docs/deployment/deployment-ha.md +++ /dev/null @@ -1,75 +0,0 @@ -To support HA, we allow to start master on multiple nodes. They will form a quorum to decide consistency. For example, if we start master on 5 nodes and 2 nodes are down, then the cluster is still consistent and functional. - -Here are the steps to enable the HA mode: - -### 1. Configure. - -#### Select master machines - -Distribute the package to all nodes. Modify `conf/gear.conf` on all nodes. You MUST configure - - :::bash - gearpump.hostname - -to make it point to your hostname(or ip), and - - :::bash - gearpump.cluster.masters - -to a list of master nodes. For example, if I have 3 master nodes (node1, node2, and node3), then the `gearpump.cluster.masters` can be set as - - :::bash - gearpump.cluster { - masters = ["node1:3000", "node2:3000", "node3:3000"] - } - - -#### Configure distributed storage to store application jars. -In `conf/gear.conf`, For entry `gearpump.jarstore.rootpath`, please choose the storage folder for application jars. You need to make sure this jar storage is highly available. We support two storage systems: - - 1). HDFS - - You need to configure the `gearpump.jarstore.rootpath` like this - - :::bash - hdfs://host:port/path/ - - - For HDFS HA, - - :::bash - hdfs://namespace/path/ - - - 2). Shared NFS folder - - First you need to map the NFS directory to local directory(same path) on all machines of master nodes. -Then you need to set the `gearpump.jarstore.rootpath` like this: - - :::bash - file:///your_nfs_mapping_directory - - - 3). If you don't set this value, we will use the local directory of master node. - NOTE! There is no HA guarantee in this case, which means we are unable to recover running applications when master goes down. - -### 2. Start Daemon. - -On node1, node2, node3, Start Master - - :::bash - ## on node1 - bin/master -ip node1 -port 3000 - - ## on node2 - bin/master -ip node2 -port 3000 - - ## on node3 - bin/master -ip node3 -port 3000 - - -### 3. Done! - -Now you have a highly available HA cluster. You can kill any node, the master HA will take effect. - -**NOTE**: It can take up to 15 seconds for master node to fail-over. You can change the fail-over timeout time by adding config in `gear.conf` `gearpump-master.akka.cluster.auto-down-unreachable-after=10s` or set it to a smaller value http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-local.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-local.md b/docs/docs/deployment/deployment-local.md deleted file mode 100644 index 81e6029..0000000 --- a/docs/docs/deployment/deployment-local.md +++ /dev/null @@ -1,34 +0,0 @@ -You can start the Gearpump service in a single JVM(local mode), or in a distributed cluster(cluster mode). To start the cluster in local mode, you can use the local /local.bat helper scripts, it is very useful for developing or troubleshooting. - -Below are the steps to start a Gearpump service in **Local** mode: - -### Step 1: Get your Gearpump binary ready -To get your Gearpump service running in local mode, you first need to have a Gearpump distribution binary ready. -Please follow [this guide](get-gearpump-distribution) to have the binary. - -### Step 2: Start the cluster -You can start a local mode cluster in single line - - :::bash - ## start the master and 2 workers in single JVM. The master will listen on 3000 - ## you can Ctrl+C to kill the local cluster after you finished the startup tutorial. - bin/local - - -**NOTE:** You may need to execute `chmod +x bin/*` in shell to make the script file `local` executable. - -**NOTE:** You can change the default port by changing config `gearpump.cluster.masters` in `conf/gear.conf`. - -**NOTE: Change the working directory**. Log files by default will be generated under current working directory. So, please "cd" to required working directly before running the shell commands. - -**NOTE: Run as Daemon**. You can run it as a background process. For example, use [nohup](http://linux.die.net/man/1/nohup) on Linux. - -### Step 3: Start the Web UI server -Open another shell, - - :::bash - bin/services - -You can manage the applications in UI [http://127.0.0.1:8090](http://127.0.0.1:8090) or by [Command Line tool](../introduction/commandline). -The default username and password is "admin:admin", you can check -[UI Authentication](../deployment/deployment-ui-authentication) to find how to manage users. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-msg-delivery.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-msg-delivery.md b/docs/docs/deployment/deployment-msg-delivery.md deleted file mode 100644 index 53e8c2e..0000000 --- a/docs/docs/deployment/deployment-msg-delivery.md +++ /dev/null @@ -1,60 +0,0 @@ -## How to deploy for At Least Once Message Delivery? - -As introduced in the [What is At Least Once Message Delivery](../introduction/message-delivery#what-is-at-least-once-message-delivery), Gearpump has a built in KafkaSource. To get at least once message delivery, users should deploy a Kafka cluster as the offset store along with the Gearpump cluster. - -Here's an example to deploy a local Kafka cluster. - -1. download the latest Kafka from the official website and extract to a local directory (`$KAFKA_HOME`) - -2. Boot up the single-node Zookeeper instance packaged with Kafka. - - :::bash - $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties - - -3. Start a Kafka broker - - :::bash - $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/kafka.properties - - -4. When creating a offset store for `KafkaSource`, set the zookeeper connect string to `localhost:2181` and broker list to `localhost:9092` in `KafkaStorageFactory`. - - :::scala - val offsetStorageFactory = new KafkaStorageFactory("localhost:2181", "localhost:9092") - val source = new KafkaSource("topic1", "localhost:2181", offsetStorageFactory) - - -## How to deploy for Exactly Once Message Delivery? - -Exactly Once Message Delivery requires both an offset store and a checkpoint store. For the offset store, a Kafka cluster should be deployed as in the previous section. As for the checkpoint store, Gearpump has built-in support for Hadoop file systems, like HDFS. Hence, users should deploy a HDFS cluster alongside the Gearpump cluster. - -Here's an example to deploy a local HDFS cluster. - -1. download Hadoop 2.6 from the official website and extracts it to a local directory `HADOOP_HOME` - -2. add following configuration to `$HADOOP_HOME/etc/core-site.xml` - - :::xml - <configuration> - <property> - <name>fs.defaultFS</name> - <value>hdfs://localhost:9000</value> - </property> - </configuration> - - -3. start HDFS - - :::bash - $HADOOP_HOME/sbin/start-dfs.sh - - -4. When creating a `HadoopCheckpointStore`, set the hadoop configuration as in the `core-site.xml` - - :::scala - val hadoopConfig = new Configuration - hadoopConfig.set("fs.defaultFS", "hdfs://localhost:9000") - val checkpointStoreFactory = new HadoopCheckpointStoreFactory("MessageCount", hadoopConfig, new FileSizeRotation(1000)) - - \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-resource-isolation.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-resource-isolation.md b/docs/docs/deployment/deployment-resource-isolation.md deleted file mode 100644 index ee47802..0000000 --- a/docs/docs/deployment/deployment-resource-isolation.md +++ /dev/null @@ -1,112 +0,0 @@ -CGroup (abbreviated from control groups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.In Gearpump, we use cgroup to manage CPU resources. - -## Start CGroup Service - -CGroup feature is only supported by Linux whose kernel version is larger than 2.6.18. Please also make sure the SELinux is disabled before start CGroup. - -The following steps are supposed to be executed by root user. - -1. Check `/etc/cgconfig.conf` exist or not. If not exists, please `yum install libcgroup`. - -2. Run following command to see whether the **cpu** subsystem is already mounted to the file system. - - :::bash - lssubsys -m - - Each subsystem in CGroup will have a corresponding mount file path in local file system. For example, the following output shows that **cpu** subsystem is mounted to file path `/sys/fs/cgroup/cpu` - - :::bash - cpu /sys/fs/cgroup/cpu - net_cls /sys/fs/cgroup/net_cls - blkio /sys/fs/cgroup/blkio - perf_event /sys/fs/cgroup/perf_event - - -3. If you want to assign permission to user **gear** to launch Gearpump Worker and applications with resource isolation enabled, you need to check gear's uid and gid in `/etc/passwd` file, let's take **500** for example. - -4. Add following content to `/etc/cgconfig.conf` - - - # The mount point of cpu subsystem. - # If your system already mounted it, this segment should be eliminated. - mount { - cpu = /cgroup/cpu; - } - - # Here the group name "gearpump" represents a node in CGroup's hierarchy tree. - # When the CGroup service is started, there will be a folder generated under the mount point of cpu subsystem, - # whose name is "gearpump". - - group gearpump { - perm { - task { - uid = 500; - gid = 500; - } - admin { - uid = 500; - gid = 500; - } - } - cpu { - } - } - - - Please note that if the output of step 2 shows that **cpu** subsystem is already mounted, then the `mount` segment should not be included. - -4. Then Start cgroup service - - :::bash - sudo service cgconfig restart - - -5. There should be a folder **gearpump** generated under the mount point of cpu subsystem and its owner is **gear:gear**. - -6. Repeat the above-mentioned steps on each machine where you want to launch Gearpump. - -## Enable Cgroups in Gearpump -1. Login into the machine which has CGroup prepared with user **gear**. - - :::bash - ssh gear@node - - -2. Enter into Gearpump's home folder, edit gear.conf under folder `${GEARPUMP_HOME}/conf/` - - :::bash - gearpump.worker.executor-process-launcher = "org.apache.gearpump.cluster.worker.CGroupProcessLauncher" - - gearpump.cgroup.root = "gearpump" - - - Please note the gearpump.cgroup.root **gearpump** must be consistent with the group name in /etc/cgconfig.conf. - -3. Repeat the above-mentioned steps on each machine where you want to launch Gearpump - -4. Start the Gearpump cluster, please refer to [Deploy Gearpump in Standalone Mode](deployment-standalone) - -## Launch Application From Command Line -1. Login into the machine which has Gearpump distribution. - -2. Enter into Gearpump's home folder, edit gear.conf under folder `${GEARPUMP_HOME}/conf/` - - :::bash - gearpump.cgroup.cpu-core-limit-per-executor = ${your_preferred_int_num} - - - Here the configuration is the number of CPU cores per executor can use and -1 means no limitation - -3. Submit application - - :::bash - bin/gear app -jar examples/sol-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}-assembly.jar -streamProducer 10 -streamProcessor 10 - - -4. Then you can run command `top` to monitor the cpu usage. - -## Launch Application From Dashboard -If you want to submit the application from dashboard, by default the `gearpump.cgroup.cpu-core-limit-per-executor` is inherited from Worker's configuration. You can provide your own conf file to override it. - -## Limitations -Windows and Mac OS X don't support CGroup, so the resource isolation will not work even if you turn it on. There will not be any limitation for single executor's cpu usage. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-security.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-security.md b/docs/docs/deployment/deployment-security.md deleted file mode 100644 index e20fc67..0000000 --- a/docs/docs/deployment/deployment-security.md +++ /dev/null @@ -1,80 +0,0 @@ -Until now Gearpump supports deployment in a secured Yarn cluster and writing to secured HBase, where "secured" means Kerberos enabled. -Further security related feature is in progress. - -## How to launch Gearpump in a secured Yarn cluster -Suppose user `gear` will launch gearpump on YARN, then the corresponding principal `gear` should be created in KDC server. - -1. Create Kerberos principal for user `gear`, on the KDC machine - - :::bash - sudo kadmin.local - - In the kadmin.local or kadmin shell, create the principal - - :::bash - kadmin: addprinc gear/[email protected] - - Remember that user `gear` must exist on every node of Yarn. - -2. Upload the gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip to remote HDFS Folder, suggest to put it under `/usr/lib/gearpump/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip` - -3. Create HDFS folder /user/gear/, make sure all read-write rights are granted for user `gear` - - :::bash - drwxr-xr-x - gear gear 0 2015-11-27 14:03 /user/gear - - -4. Put the YARN configurations under classpath. - Before calling `yarnclient launch`, make sure you have put all yarn configuration files under classpath. Typically, you can just copy all files under `$HADOOP_HOME/etc/hadoop` from one of the YARN cluster machine to `conf/yarnconf` of gearpump. `$HADOOP_HOME` points to the Hadoop installation directory. - -5. Get Kerberos credentials to submit the job: - - :::bash - kinit gearpump/[email protected] - - - Here you can login with keytab or password. Please refer Kerberos's document for details. - - :::bash - yarnclient launch -package /usr/lib/gearpump/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip - - -## How to write to secured HBase -When the remote HBase is security enabled, a kerberos keytab and the corresponding principal name need to be -provided for the gearpump-hbase connector. Specifically, the `UserConfig` object passed into the HBaseSink should contain -`{("gearpump.keytab.file", "\\$keytab"), ("gearpump.kerberos.principal", "\\$principal")}`. example code of writing to secured HBase: - - :::scala - val principal = "gearpump/[email protected]" - val keytabContent = Files.toByteArray(new File("path_to_keytab_file")) - val appConfig = UserConfig.empty - .withString("gearpump.kerberos.principal", principal) - .withBytes("gearpump.keytab.file", keytabContent) - val sink = new HBaseSink(appConfig, "$tableName") - val sinkProcessor = DataSinkProcessor(sink, "$sinkNum") - val split = Processor[Split]("$splitNum") - val computation = split ~> sinkProcessor - val application = StreamApplication("HBase", Graph(computation), UserConfig.empty) - - -Note here the keytab file set into config should be a byte array. - -## Future Plan - -### More external components support -1. HDFS -2. Kafka - -### Authentication(Kerberos) -Since Gearpumpâs Master-Worker structure is similar to HDFSâs NameNode-DataNode and Yarnâs ResourceManager-NodeManager, we may follow the way they use. - -1. User creates kerberos principal and keytab for Gearpump. -2. Deploy the keytab files to all the cluster nodes. -3. Configure Gearpumpâs conf file, specify kerberos principal and local keytab file location. -4. Start Master and Worker. - -Every application has a submitter/user. We will separate the application from different users, like different log folders for different applications. -Only authenticated users can submit the application to Gearpump's Master. - -### Authorization -Hopefully more on this soon http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-standalone.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-standalone.md b/docs/docs/deployment/deployment-standalone.md deleted file mode 100644 index c9d5549..0000000 --- a/docs/docs/deployment/deployment-standalone.md +++ /dev/null @@ -1,59 +0,0 @@ -Standalone mode is a distributed cluster mode. That is, Gearpump runs as service without the help from other services (e.g. YARN). - -To deploy Gearpump in cluster mode, please first check that the [Pre-requisites](hardware-requirement) are met. - -### How to Install -You need to have Gearpump binary at hand. Please refer to [How to get gearpump distribution](get-gearpump-distribution) to get the Gearpump binary. - -You are suggested to unzip the package to same directory path on every machine you planned to install Gearpump. -To install Gearpump, you at least need to change the configuration in `conf/gear.conf`. - -Config | Default value | Description ------------- | ---------------|------------ -gearpump.hostname | "127.0.0.1" | Host or IP address of current machine. The ip/host need to be reachable from other machines in the cluster. -gearpump.cluster.masters | ["127.0.0.1:3000"] | List of all master nodes, with each item represents host and port of one master. -gearpump.worker.slots | 1000 | how many slots this worker has - -Besides this, there are other optional configurations related with logs, metrics, transports, ui. You can refer to [Configuration Guide](deployment-configuration) for more details. - -### Start the Cluster Daemons in Standlone mode -In Standalone mode, you can start master and worker in different JVMs. - -##### To start master: - - :::bash - bin/master -ip xx -port xx - -The ip and port will be checked against settings under `conf/gear.conf`, so you need to make sure they are consistent. - -**NOTE:** You may need to execute `chmod +x bin/*` in shell to make the script file `master` executable. - -**NOTE**: for high availability, please check [Master HA Guide](deployment-ha) - -##### To start worker: - - :::bash - bin/worker - -### Start UI - - :::bash - bin/services - - -After UI is started, you can browse to `http://{web_ui_host}:8090` to view the cluster status. -The default username and password is "admin:admin", you can check -[UI Authentication](deployment-ui-authentication) to find how to manage users. - - - -**NOTE:** The UI port can be configured in `gear.conf`. Check [Configuration Guide](deployment-configuration) for information. - -### Bash tool to start cluster - -There is a bash tool `bin/start-cluster.sh` can launch the cluster conveniently. You need to change the file `conf/masters`, `conf/workers` and `conf/dashboard` to specify the corresponding machines. -Before running the bash tool, please make sure the Gearpump package is already unzipped to the same directory path on every machine. -`bin/stop-cluster.sh` is used to stop the whole cluster of course. - -The bash tool is able to launch the cluster without changing the `conf/gear.conf` on every machine. The bash sets the `gearpump.cluster.masters` and other configurations using JAVA_OPTS. -However, please note when you log into any these unconfigured machine and try to launch the dashboard or submit the application, you still need to modify `conf/gear.conf` manually because the JAVA_OPTS is missing. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-ui-authentication.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-ui-authentication.md b/docs/docs/deployment/deployment-ui-authentication.md deleted file mode 100644 index 0990192..0000000 --- a/docs/docs/deployment/deployment-ui-authentication.md +++ /dev/null @@ -1,290 +0,0 @@ -## What is this about? - -## How to enable UI authentication? - -1. Change config file gear.conf, find entry `gearpump-ui.gearpump.ui-security.authentication-enabled`, change the value to true - - :::bash - gearpump-ui.gearpump.ui-security.authentication-enabled = true - - - Restart the UI dashboard, and then the UI authentication is enabled. It will prompt for user name and password. - -## How many authentication methods Gearpump UI server support? - -Currently, It supports: - -1. Username-Password based authentication and -2. OAuth2 based authentication. - -User-Password based authentication is enabled when `gearpump-ui.gearpump.ui-security.authentication-enabled`, - and **CANNOT** be disabled. - -UI server admin can also choose to enable **auxiliary** OAuth2 authentication channel. - -## User-Password based authentication - - User-Password based authentication covers all authentication scenarios which requires - user to enter an explicit username and password. - - Gearpump provides a built-in ConfigFileBasedAuthenticator which verify user name and password - against password hashcode stored in config files. - - However, developer can choose to extends the `org.apache.gearpump.security.Authenticator` to provide a custom - User-Password based authenticator, to support LDAP, Kerberos, and Database-based authentication... - -### ConfigFileBasedAuthenticator: built-in User-Password Authenticator - -ConfigFileBasedAuthenticator store all user name and password hashcode in configuration file gear.conf. Here -is the steps to configure ConfigFileBasedAuthenticator. - -#### How to add or remove user? - -For the default authentication plugin, it has three categories of users: admins, users, and guests. - -* admins: have unlimited permission, like shutdown a cluster, add/remove machines. -* users: have limited permission to submit an application and etc.. -* guests: can not submit/kill applications, but can view the application status. - -System administrator can add or remove user by updating config file `conf/gear.conf`. - -Suppose we want to add user jerry as an administrator, here are the steps: - -1. Pick a password, and generate the digest for this password. Suppose we use password `ilovegearpump`, - to generate the digest: - - :::bash - bin/gear org.apache.gearpump.security.PasswordUtil -password ilovegearpump - - - It will generate a digest value like this: - - :::bash - CgGxGOxlU8ggNdOXejCeLxy+isrCv0TrS37HwA== - - -2. Change config file conf/gear.conf at path `gearpump-ui.gearpump.ui-security.config-file-based-authenticator.admins`, - add user `jerry` in this list: - - :::bash - admins = { - ## Default Admin. Username: admin, password: admin - ## !!! Please replace this builtin account for production cluster for security reason. !!! - "admin" = "AeGxGOxlU8QENdOXejCeLxy+isrCv0TrS37HwA==" - "jerry" = "CgGxGOxlU8ggNdOXejCeLxy+isrCv0TrS37HwA==" - } - - -3. Restart the UI dashboard by `bin/services` to make the change effective. - -4. Group "admins" have very unlimited permission, you may want to restrict the permission. In that case - you can modify `gearpump-ui.gearpump.ui-security.config-file-based-authenticator.users` or - `gearpump-ui.gearpump.ui-security.config-file-based-authenticator.guests`. - -5. See description at `conf/gear.conf` to find more information. - -#### What is the default user and password? - -For ConfigFileBasedAuthenticator, Gearpump distribution is shipped with two default users: - -1. username: admin, password: admin -2. username: guest, password: guest - -User `admin` has unlimited permissions, while `guest` can only view the application status. - -For security reason, you need to remove the default users `admin` and `guest` for cluster in production. - -#### Is this secure? - -Firstly, we will NOT store any user password in any way so only the user himself knows the password. -We will use one-way hash digest to verify the user input password. - -### How to develop a custom User-Password Authenticator for LDAP, Database, and etc.. - -If developer choose to define his/her own User-Password based authenticator, it is required that user - modify configuration option: - - :::bash - ## Replace "org.apache.gearpump.security.CustomAuthenticator" with your real authenticator class. - gearpump.ui-security.authenticator = "org.apache.gearpump.security.CustomAuthenticator" - - -Make sure CustomAuthenticator extends interface: - - :::scala - trait Authenticator { - - def authenticate(user: String, password: String, ec: ExecutionContext): Future[AuthenticationResult] - } - - -## OAuth2 based authentication - -OAuth2 based authentication is commonly use to achieve social login with social network account. - -Gearpump provides generic OAuth2 Authentication support which allow user to extend to support new authentication sources. - -Basically, OAuth2 based Authentication contains these steps: - 1. User accesses Gearpump UI website, and choose to login with OAuth2 server. - 2. Gearpump UI website redirects user to OAuth2 server domain authorization endpoint. - 3. End user complete the authorization in the domain of OAuth2 server. - 4. OAuth2 server redirects user back to Gearpump UI server. - 5. Gearpump UI server verify the tokens and extract credentials from query - parameters and form fields. - -### Terminologies - -For terms like client Id, and client secret, please refers to guide [RFC 6749](https://tools.ietf.org/html/rfc6749) - -### Enable web proxy for UI server - -To enable OAuth2 authentication, the Gearpump UI server should have network access to OAuth2 server, as - some requests are initiated directly inside Gearpump UI server. So, if you are behind a firewall, make - sure you have configured the proxy properly for UI server. - -#### If you are on Windows - - :::bash - set JAVA_OPTS=-Dhttp.proxyHost=xx.com -Dhttp.proxyPort=8088 -Dhttps.proxyHost=xx.com -Dhttps.proxyPort=8088 - bin/services - - -#### If you are on Linux - - :::bash - export JAVA_OPTS="-Dhttp.proxyHost=xx.com -Dhttp.proxyPort=8088 -Dhttps.proxyHost=xx.com -Dhttps.proxyPort=8088" - bin/services - - -### Google Plus OAuth2 Authenticator - -Google Plus OAuth2 Authenticator does authentication with Google OAuth2 service. It extracts the email address -from Google user profile as credentials. - -To use Google OAuth2 Authenticator, there are several steps: - -1. Register your application (Gearpump UI server here) as an application to Google developer console. -2. Configure the Google OAuth2 information in gear.conf -3. Configure network proxy for Gearpump UI server if applies. - -#### Step1: Register your website as an OAuth2 Application on Google - -1. Create an application representing your website at [https://console.developers.google.com](https://console.developers.google.com) -2. In "API Manager" of your created application, enable API "Google+ API" -3. Create OAuth client ID for this application. In "Credentials" tab of "API Manager", -choose "Create credentials", and then select OAuth client ID. Follow the wizard -to set callback URL, and generate client ID, and client Secret. - -**NOTE:** Callback URL is NOT optional. - -#### Step2: Configure the OAuth2 information in gear.conf - -1. Enable OAuth2 authentication by setting `gearpump.ui-security.oauth2-authenticator-enabled` -as true. -2. Configure section `gearpump.ui-security.oauth2-authenticators.google` in gear.conf. Please make sure -class name, client ID, client Secret, and callback URL are set properly. - -**NOTE:** Callback URL set here should match what is configured on Google in step1. - -#### Step3: Configure the network proxy if applies. - -To enable OAuth2 authentication, the Gearpump UI server should have network access to Google service, as - some requests are initiated directly inside Gearpump UI server. So, if you are behind a firewall, make - sure you have configured the proxy properly for UI server. - -For guide of how to configure web proxy for UI server, please refer to section "Enable web proxy for UI server" above. - -#### Step4: Restart the UI server and try to click the Google login icon on UI server. - -### CloudFoundry UAA server OAuth2 Authenticator - -CloudFoundryUaaAuthenticator does authentication by using CloudFoundry UAA OAuth2 service. It extracts the email address - from Google user profile as credentials. - -For what is UAA (User Account and Authentication Service), please see guide: [UAA](https://github.com/cloudfoundry/uaa) - -To use Google OAuth2 Authenticator, there are several steps: - -1. Register your application (Gearpump UI server here) as an application to UAA with helper tool `uaac`. -2. Configure the Google OAuth2 information in gear.conf -3. Configure network proxy for Gearpump UI server if applies. - -#### Step1: Register your application to UAA with `uaac` - -1. Check tutorial on uaac at [https://docs.cloudfoundry.org/adminguide/uaa-user-management.html](https://docs.cloudfoundry.org/adminguide/uaa-user-management.html) -2. Open a bash shell, set the UAA server by command `uaac target` - - :::bash - uaac target [your uaa server url] - - -3. Login in as user admin by - - :::bash - uaac token client get admin -s MyAdminPassword - - -4. Create a new Application (Client) in UAA, - - :::bash - uaac client add [your_client_id] - --scope "openid cloud_controller.read" - --authorized_grant_types "authorization_code client_credentials refresh_token" - --authorities "openid cloud_controller.read" - --redirect_uri [your_redirect_url] - --autoapprove true - --secret [your_client_secret] - - -#### Step2: Configure the OAuth2 information in gear.conf - -1. Enable OAuth2 authentication by setting `gearpump.ui-security.oauth2-authenticator-enabled` as true. -2. Navigate to section `gearpump.ui-security.oauth2-authenticators.cloudfoundryuaa` -3. Config gear.conf `gearpump.ui-security.oauth2-authenticators.cloudfoundryuaa` section. -Please make sure class name, client ID, client Secret, and callback URL are set properly. - -**NOTE:** The callback URL here should match what you set on CloudFoundry UAA in step1. - -#### Step3: Configure network proxy for Gearpump UI server if applies - -To enable OAuth2 authentication, the Gearpump UI server should have network access to Google service, as - some requests are initiated directly inside Gearpump UI server. So, if you are behind a firewall, make - sure you have configured the proxy properly for UI server. - -For guide of how to configure web proxy for UI server, please refer to please refer to section "Enable web proxy for UI server" above. - -#### Step4: Restart the UI server and try to click the CloudFoundry login icon on UI server. - -#### Step5: You can also enable additional authenticator for CloudFoundry UAA by setting config: - - :::bash - additional-authenticator-enabled = true - - -Please see description in gear.conf for more information. - -#### Extends OAuth2Authenticator to support new Authorization service like Facebook, or Twitter. - -You can follow the Google OAuth2 example code to define a custom OAuth2Authenticator. Basically, the steps includes: - -1. Define an OAuth2Authenticator implementation. - -2. Add an configuration entry under `gearpump.ui-security.oauth2-authenticators`. For example: - - ## name of this authenticator - "socialnetworkx" { - "class" = "org.apache.gearpump.services.security.oauth2.impl.SocialNetworkXAuthenticator" - - ## Please make sure this URL matches the name - "callback" = "http://127.0.0.1:8090/login/oauth2/socialnetworkx/callback" - - "clientId" = "gearpump_test2" - "clientSecret" = "gearpump_test2" - "defaultUserRole" = "guest" - - ## Make sure socialnetworkx.png exists under dashboard/icons - "icon" = "/icons/socialnetworkx.png" - } - - - The configuration entry is supposed to be used by class `SocialNetworkXAuthenticator`. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/deployment-yarn.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/deployment-yarn.md b/docs/docs/deployment/deployment-yarn.md deleted file mode 100644 index 401fa46..0000000 --- a/docs/docs/deployment/deployment-yarn.md +++ /dev/null @@ -1,135 +0,0 @@ -## How to launch a Gearpump cluster on YARN - -1. Upload the `gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip` to remote HDFS Folder, suggest to put it under `/usr/lib/gearpump/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip` - -2. Make sure the home directory on HDFS is already created and all read-write rights are granted for user. For example, user gear's home directory is `/user/gear` - -3. Put the YARN configurations under classpath. - Before calling `yarnclient launch`, make sure you have put all yarn configuration files under classpath. Typically, you can just copy all files under `$HADOOP_HOME/etc/hadoop` from one of the YARN Cluster machine to `conf/yarnconf` of gearpump. `$HADOOP_HOME` points to the Hadoop installation directory. - -4. Launch the gearpump cluster on YARN - - :::bash - yarnclient launch -package /usr/lib/gearpump/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip - - - If you don't specify package path, it will read default package-path (`gearpump.yarn.client.package-path`) from `gear.conf`. - - **NOTE:** You may need to execute `chmod +x bin/*` in shell to make the script file `yarnclient` executable. - -5. After launching, you can browser the Gearpump UI via YARN resource manager dashboard. - -## How to configure the resource limitation of Gearpump cluster - -Before launching a Gearpump cluster, please change configuration section `gearpump.yarn` in `gear.conf` to configure the resource limitation, like: - -1. The number of worker containers. -2. The YARN container memory size for worker and master. - -## How to submit a application to Gearpump cluster. - -To submit the jar to the Gearpump cluster, we first need to know the Master address, so we need to get -a active configuration file first. - -There are two ways to get an active configuration file: - -1. Option 1: specify "-output" option when you launch the cluster. - - :::bash - yarnclient launch -package /usr/lib/gearpump/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip -output /tmp/mycluster.conf - - - It will return in console like this: - - :::bash - ==Application Id: application_1449802454214_0034 - - - -2. Option 2: Query the active configuration file - - :::bash - yarnclient getconfig -appid <yarn application id> -output /tmp/mycluster.conf - - - yarn application id can be found from the output of step1 or from YARN dashboard. - -3. After you downloaded the configuration file, you can launch application with that config file. - - :::bash - gear app -jar examples/wordcount-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.jar -conf /tmp/mycluster.conf - - -4. To run Storm application over Gearpump on YARN, please store the configuration file with `-output application.conf` - and then launch Storm application with - - :::bash - storm -jar examples/storm-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.jar storm.starter.ExclamationTopology exclamation - - -5. Now the application is running. To check this: - - :::bash - gear info -conf /tmp/mycluster.conf - - -6. To Start a UI server, please do: - - :::bash - services -conf /tmp/mycluster.conf - - - The default username and password is "admin:admin", you can check [UI Authentication](deployment-ui-authentication) to find how to manage users. - - -## How to add/remove machines dynamically. - -Gearpump yarn tool allows to dynamically add/remove machines. Here is the steps: - -1. First, query to get active resources. - - :::bash - yarnclient query -appid <yarn application id> - - - The console output will shows how many workers and masters there are. For example, I have output like this: - - :::bash - masters: - container_1449802454214_0034_01_000002(IDHV22-01:35712) - workers: - container_1449802454214_0034_01_000003(IDHV22-01:35712) - container_1449802454214_0034_01_000006(IDHV22-01:35712) - - -2. To add a new worker machine, you can do: - - :::bash - yarnclient addworker -appid <yarn application id> -count 2 - - - This will add two new workers machines. Run the command in first step to check whether the change is effective. - -3. To remove old machines, use: - - :::bash - yarnclient removeworker -appid <yarn application id> -container <worker container id> - - - The worker container id can be found from the output of step 1. For example "container_1449802454214_0034_01_000006" is a good container id. - -## Other usage: - -1. To kill a cluster, - - :::bash - yarnclient kill -appid <yarn application id> - - - **NOTE:** If the application is not launched successfully, then this command won't work. Please use "yarn application -kill <appId>" instead. - -2. To check the Gearpump version - - :::bash - yarnclient version -appid <yarn application id> - http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/get-gearpump-distribution.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/get-gearpump-distribution.md b/docs/docs/deployment/get-gearpump-distribution.md deleted file mode 100644 index 8a31113..0000000 --- a/docs/docs/deployment/get-gearpump-distribution.md +++ /dev/null @@ -1,83 +0,0 @@ -### Prepare the binary -You can either download pre-build release package or choose to build from source code. - -#### Download Release Binary - -If you choose to use pre-build package, then you don't need to build from source code. The release package can be downloaded from: - -##### [Download page](http://gearpump.incubator.apache.org/downloads.html) - -#### Build from Source code - -If you choose to build the package from source code yourself, you can follow these steps: - -1). Clone the Gearpump repository - - :::bash - git clone https://github.com/apache/incubator-gearpump.git - cd gearpump - - -2). Build package - - :::bash - ## Please use scala 2.11 - ## The target package path: output/target/gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip - sbt clean assembly packArchiveZip - - - After the build, there will be a package file gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip generated under output/target/ folder. - - **NOTE:** - Please set JAVA_HOME environment before the build. - - On linux: - - :::bash - export JAVA_HOME={path/to/jdk/root/path} - - - On Windows: - - :::bash - set JAVA_HOME={path/to/jdk/root/path} - - - **NOTE:** -The build requires network connection. If you are behind an enterprise proxy, make sure you have set the proxy in your env before running the build commands. -For windows: - - :::bash - set HTTP_PROXY=http://host:port - set HTTPS_PROXY= http://host:port - - -For Linux: - - :::bash - export HTTP_PROXY=http://host:port - export HTTPS_PROXY= http://host:port - - -### Gearpump package structure - -You need to flatten the `.zip` file to use it. On Linux, you can - - :::bash - unzip gearpump-{{SCALA_BINARY_VERSION}}-{{GEARPUMP_VERSION}}.zip - - -After decompression, the directory structure looks like picture 1. - - - -Under bin/ folder, there are script files for Linux(bash script) and Windows(.bat script). - -script | function ---------|------------ -local | You can start the Gearpump cluster in single JVM(local mode), or in a distributed cluster(cluster mode). To start the cluster in local mode, you can use the local /local.bat helper scripts, it is very useful for developing or troubleshooting. -master | To start Gearpump in cluster mode, you need to start one or more master nodes, which represent the global resource management center. master/master.bat is launcher script to boot the master node. -worker | To start Gearpump in cluster mode, you also need to start several workers, with each worker represent a set of local resources. worker/worker.bat is launcher script to start the worker node. -services | This script is used to start backend REST service and other services for frontend UI dashboard (Default user "admin, admin"). - -Please check [Command Line Syntax](../introduction/commandline) for more information for each script. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/deployment/hardware-requirement.md ---------------------------------------------------------------------- diff --git a/docs/docs/deployment/hardware-requirement.md b/docs/docs/deployment/hardware-requirement.md deleted file mode 100644 index dfe765b..0000000 --- a/docs/docs/deployment/hardware-requirement.md +++ /dev/null @@ -1,30 +0,0 @@ -### Pre-requisite - -Gearpump cluster can be installed on Windows OS and Linux. - -Before installation, you need to decide how many machines are used to run this cluster. - -For each machine, the requirements are listed in table below. - -**Table: Environment requirement on single machine** - -Resource | Requirements ------------- | --------------------------- -Memory | 2GB free memory is required to run the cluster. For any production system, 32GB memory is recommended. -Java | JRE 6 or above -User permission | Root permission is not required -Network Ethernet |(TCP/IP) -CPU | Nothing special -HDFS installation | Default is not required. You only need to install it when you want to store the application jars in HDFS. -Kafka installation | Default is not required. You need to install Kafka when you want the at-least once message delivery feature. Currently, the only supported data source for this feature is Kafka - -**Table: The default port used in Gearpump:** - -| usage | Port | Description | ------------- | ---------------|------------ - Dashboard UI | 8090 | Web UI. -Dashboard web socket service | 8091 | UI backend web socket service for long connection. -Master port | 3000 | Every other role like worker, appmaster, executor, user use this port to communicate with Master. - -You need to ensure that your firewall has not banned these ports to ensure Gearpump can work correctly. -And you can modify the port configuration. Check [Configuration](deployment-configuration) section for details. http://git-wip-us.apache.org/repos/asf/incubator-gearpump/blob/5f90b70f/docs/docs/dev/dev-connectors.md ---------------------------------------------------------------------- diff --git a/docs/docs/dev/dev-connectors.md b/docs/docs/dev/dev-connectors.md deleted file mode 100644 index 01deb3e..0000000 --- a/docs/docs/dev/dev-connectors.md +++ /dev/null @@ -1,237 +0,0 @@ -## Basic Concepts -`DataSource` and `DataSink` are the two main concepts Gearpump use to connect with the outside world. - -### DataSource -`DataSource` is the start point of a streaming processing flow. - - -### DataSink -`DataSink` is the end point of a streaming processing flow. - -## Implemented Connectors - -### `DataSource` implemented -Currently, we have following `DataSource` supported. - -Name | Description ------| ---------- -`CollectionDataSource` | Convert a collection to a recursive data source. E.g. `seq(1, 2, 3)` will output `1,2,3,1,2,3...`. -`KafkaSource` | Read from Kafka. - -### `DataSink` implemented -Currently, we have following `DataSink` supported. - -Name | Description ------| ---------- -`HBaseSink` | Write the message to HBase. The message to write must be HBase `Put` or a tuple of `(rowKey, family, column, value)`. -`KafkaSink` | Write to Kafka. - -## Use of Connectors - -### Use of Kafka connectors - -To use Kafka connectors in your application, you first need to add the `gearpump-external-kafka` library dependency in your application: - -#### SBT - - :::sbt - "org.apache.gearpump" %% "gearpump-external-kafka" % {{GEARPUMP_VERSION}} - -#### XML - - :::xml - <dependency> - <groupId>org.apache.gearpump</groupId> - <artifactId>gearpump-external-kafka</artifactId> - <version>{{GEARPUMP_VERSION}}</version> - </dependency> - - -This is a simple example to read from Kafka and write it back using `KafkaSource` and `KafkaSink`. Users can optionally set a `CheckpointStoreFactory` such that Kafka offsets are checkpointed and at-least-once message delivery is guaranteed. - -#### Low level API - - :::scala - val appConfig = UserConfig.empty - val props = new Properties - props.put(KafkaConfig.ZOOKEEPER_CONNECT_CONFIG, zookeeperConnect) - props.put(KafkaConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList) - props.put(KafkaConfig.CHECKPOINT_STORE_NAME_PREFIX_CONFIG, appName) - val source = new KafkaSource(sourceTopic, props) - val checkpointStoreFactory = new KafkaStoreFactory(props) - source.setCheckpointStore(checkpointStoreFactory) - val sourceProcessor = DataSourceProcessor(source, sourceNum) - val sink = new KafkaSink(sinkTopic, props) - val sinkProcessor = DataSinkProcessor(sink, sinkNum) - val partitioner = new ShufflePartitioner - val computation = sourceProcessor ~ partitioner ~> sinkProcessor - val app = StreamApplication(appName, Graph(computation), appConfig) - -#### High level API - - :::scala - val props = new Properties - val appName = "KafkaDSL" - props.put(KafkaConfig.ZOOKEEPER_CONNECT_CONFIG, zookeeperConnect) - props.put(KafkaConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList) - props.put(KafkaConfig.CHECKPOINT_STORE_NAME_PREFIX_CONFIG, appName) - - val app = StreamApp(appName, context) - - if (atLeastOnce) { - val checkpointStoreFactory = new KafkaStoreFactory(props) - KafkaDSL.createAtLeastOnceStream(app, sourceTopic, checkpointStoreFactory, props, sourceNum) - .writeToKafka(sinkTopic, props, sinkNum) - } else { - KafkaDSL.createAtMostOnceStream(app, sourceTopic, props, sourceNum) - .writeToKafka(sinkTopic, props, sinkNum) - } - - -In the above example, configurations are set through Java properties and shared by `KafkaSource`, `KafkaSink` and `KafkaCheckpointStoreFactory`. -Their configurations can be defined differently as below. - -#### `KafkaSource` configurations - -Name | Descriptions | Type | Default ----- | ------------ | ---- | ------- -`KafkaConfig.ZOOKEEPER_CONNECT_CONFIG` | Zookeeper connect string for Kafka topics management | String -`KafkaConfig.CLIENT_ID_CONFIG` | An id string to pass to the server when making requests | String | "" -`KafkaConfig.GROUP_ID_CONFIG` | A string that uniquely identifies a set of consumers within the same consumer group | "" -`KafkaConfig.FETCH_SLEEP_MS_CONFIG` | The amount of time(ms) to sleep when hitting fetch.threshold | Int | 100 -`KafkaConfig.FETCH_THRESHOLD_CONFIG` | Size of internal queue to keep Kafka messages. Stop fetching and go to sleep when hitting the threshold | Int | 10000 -`KafkaConfig.PARTITION_GROUPER_CLASS_CONFIG` | Partition grouper class to group partitions among source tasks | Class | DefaultPartitionGrouper -`KafkaConfig.MESSAGE_DECODER_CLASS_CONFIG` | Message decoder class to decode raw bytes from Kafka | Class | DefaultMessageDecoder -`KafkaConfig.TIMESTAMP_FILTER_CLASS_CONFIG` | Timestamp filter class to filter out late messages | Class | DefaultTimeStampFilter - - -#### `KafkaSink` configurations - -Name | Descriptions | Type | Default ----- | ------------ | ---- | ------- -`KafkaConfig.BOOTSTRAP_SERVERS_CONFIG` | A list of host/port pairs to use for establishing the initial connection to the Kafka cluster | String | -`KafkaConfig.CLIENT_ID_CONFIG` | An id string to pass to the server when making requests | String | "" - -#### `KafkaCheckpointStoreFactory` configurations - -Name | Descriptions | Type | Default ----- | ------------ | ---- | ------- -`KafkaConfig.ZOOKEEPER_CONNECT_CONFIG` | Zookeeper connect string for Kafka topics management | String | -`KafkaConfig.BOOTSTRAP_SERVERS_CONFIG` | A list of host/port pairs to use for establishing the initial connection to the Kafka cluster | String | -`KafkaConfig.CHECKPOINT_STORE_NAME_PREFIX` | Name prefix for checkpoint store | String | "" -`KafkaConfig.REPLICATION_FACTOR` | Replication factor for checkpoint store topic | Int | 1 - -### Use of `HBaseSink` - -To use `HBaseSink` in your application, you first need to add the `gearpump-external-hbase` library dependency in your application: - -#### SBT - - :::sbt - "org.apache.gearpump" %% "gearpump-external-hbase" % {{GEARPUMP_VERSION}} - -#### XML - :::xml - <dependency> - <groupId>org.apache.gearpump</groupId> - <artifactId>gearpump-external-hbase</artifactId> - <version>{{GEARPUMP_VERSION}}</version> - </dependency> - - -To connect to HBase, you need to provide following info: - - * the HBase configuration to tell which HBase service to connect - * the table name (you must create the table yourself, see the [HBase documentation](https://hbase.apache.org/book.html)) - -Then, you can use `HBaseSink` in your application: - - :::scala - //create the HBase data sink - val sink = HBaseSink(UserConfig.empty, tableName, HBaseConfiguration.create()) - - //create Gearpump Processor - val sinkProcessor = DataSinkProcessor(sink, parallelism) - - - :::scala - //assume stream is a normal `Stream` in DSL - stream.writeToHbase(UserConfig.empty, tableName, parallelism, "write to HBase") - - -You can tune the connection to HBase via the HBase configuration passed in. If not passed, Gearpump will try to check local classpath to find a valid HBase configuration (`hbase-site.xml`). - -Attention, due to the issue discussed [here](http://stackoverflow.com/questions/24456484/hbase-managed-zookeeper-suddenly-trying-to-connect-to-localhost-instead-of-zooke) you may need to create additional configuration for your HBase sink: - - :::scala - def hadoopConfig = { - val conf = new Configuration() - conf.set("hbase.zookeeper.quorum", "zookeeperHost") - conf.set("hbase.zookeeper.property.clientPort", "2181") - conf - } - val sink = HBaseSink(UserConfig.empty, tableName, hadoopConfig) - - -## How to implement your own `DataSource` - -To implement your own `DataSource`, you need to implement two things: - -1. The data source itself -2. a helper class to easy the usage in a DSL - -### Implement your own `DataSource` -You need to implement a class derived from `org.apache.gearpump.streaming.transaction.api.TimeReplayableSource`. - -### Implement DSL helper (Optional) -If you would like to have a DSL at hand you may start with this customized stream; it is better if you can implement your own DSL helper. -You can refer `KafkaDSLUtil` as an example in Gearpump source. - -Below is some code snippet from `KafkaDSLUtil`: - - :::scala - object KafkaDSLUtil { - - def createStream[T]( - app: StreamApp, - topics: String, - parallelism: Int, - description: String, - properties: Properties): dsl.Stream[T] = { - app.source[T](new KafkaSource(topics, properties), parallelism, description) - } - } - - -## How to implement your own `DataSink` -To implement your own `DataSink`, you need to implement two things: - -1. The data sink itself -2. a helper class to make it easy use in DSL - -### Implement your own `DataSink` -You need to implement a class derived from `org.apache.gearpump.streaming.sink.DataSink`. - -### Implement DSL helper (Optional) -If you would like to have a DSL at hand you may start with this customized stream; it is better if you can implement your own DSL helper. -You can refer `HBaseDSLSink` as an example in Gearpump source. - -Below is some code snippet from `HBaseDSLSink`: - - :::scala - class HBaseDSLSink[T](stream: Stream[T]) { - def writeToHbase(userConfig: UserConfig, table: String, parallism: Int, description: String): Stream[T] = { - stream.sink(HBaseSink[T](userConfig, table), parallism, userConfig, description) - } - - def writeToHbase(userConfig: UserConfig, configuration: Configuration, table: String, parallism: Int, description: String): Stream[T] = { - stream.sink(HBaseSink[T](userConfig, table, configuration), parallism, userConfig, description) - } - } - - object HBaseDSLSink { - implicit def streamToHBaseDSLSink[T](stream: Stream[T]): HBaseDSLSink[T] = { - new HBaseDSLSink[T](stream) - } - } -
