Github user mattf-horton commented on a diff in the pull request:
https://github.com/apache/metron/pull/614#discussion_r121000712
--- Diff: metron-platform/Performance-tuning-guide.md ---
@@ -0,0 +1,326 @@
+# Metron Performance Tunining Guide
+
+## Overview
+
+This document provides guidance from our experiences tuning the Apache
Metron Storm topologies for maximum performance. You'll find
+suggestions for optimum configurations under a 1 gbps load along with some
guidance around the tooling we used to monitor and assess
+our throughput.
+
+In the simplest terms, Metron is a streaming architecture created on top
of Kafka and three main types of Storm topologies: parsers,
+enrichment, and indexing. Each parser has it's own topology and there is
also a highly performant, specialized spout-only topology
+for streaming PCAP data to HDFS. We found that the architecture can be
tuned almost exclusively through using a few primary Storm and
+Kafka parameters along with a few Metron-specific options. You can think
of the data flow as being similar to water flowing through a
+pipe, and the majority of these options assist in tweaking the various
pipe widths in the system.
+
+## General Suggestions
+
+Note that there is currently no method for specifying the number of tasks
from the number of executors in Flux topologies (enrichment,
+ indexing). By default, the number of tasks will equal the number of
executors. Logically, setting the number of tasks equal to the number
+of executors is sensible. Storm enforces # executors <= # tasks. The
reason you might set the number of tasks higher than the number of
+executors is for future performance tuning and rebalancing without the
need to bring down your topologies. The number of tasks is fixed
+at topology startup time whereas the number of executors can be increased
up to a maximum value equal to the number of tasks.
+
+We found that the default values for poll.timeout.ms,
offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all
cases.
+As a general rule, it was optimal to set spout parallelism equal to the
number of partitions used in your Kafka topic. Any greater
+parallelism will leave you with idle consumers since Kafka limits the max
number of consumers to the number of partitions. This is
+important because Kafka has certain ordering guarantees for message
delivery per partition that would not be possible if more than
+one consumer in a given consumer group were able to read from that
partition.
+
+## Tooling
+
+Before we get to the actual tooling used to monitor performance, it helps
to describe what we might actually want to monitor and potential
+pain points. Prior to switching over to the new Storm Kafka client, which
leverages the new Kafka consumer API under the hood, offsets
+were stored in Zookeeper. While the broker hosts are still stored in
Zookeeper, this is no longer true for the offsets which are now
+stored in Kafka itself. This is a configurable option, and you may switch
back to Zookeeper if you choose, but Metron is currently using
+the new defaults. This is useful to know as you're investigating both
correctness as well as throughput performance.
+
+First we need to setup some environment variables
+```
+export BROKERLIST=<your broker comma-delimated list of host:ports>
+export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
+export KAFKA_HOME=<kafka home dir>
+export METRON_HOME=<your metron home>
+export HDP_HOME=<your HDP home>
+```
+
+If you have Kerberos enabled, setup the security protocol
+```
+$ cat /tmp/consumergroup.config
+security.protocol=SASL_PLAINTEXT
+```
+
+Now run the following command for a running topology's consumer group. In
this example we are using enrichments.
+```
+${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
+ --command-config=/tmp/consumergroup.config \
+ --describe \
+ --group enrichments \
+ --bootstrap-server $BROKERLIST \
+ --new-consumer
+```
+
+This will return a table with the following output depicting offsets for
all partitions and consumers associated with the specified
+consumer group:
+```
+GROUP TOPIC PARTITION
CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
+enrichments enrichments 9 29746066
29746067 1 consumer-2_/xxx.xxx.xxx.xxx
+enrichments enrichments 3 29754325
29754326 1 consumer-1_/xxx.xxx.xxx.xxx
+enrichments enrichments 43 29754331
29754332 1 consumer-6_/xxx.xxx.xxx.xxx
+...
+```
+
+_Note_: You won't see any output until a topology is actually running
because the consumer groups only exist while consumers in the
+spouts are up and running.
+
+The primary column we're concerned with paying attention to is the LAG
column, which is the current delta calculation between the
+current and end offset for the partition. This tells us how close we are
to keeping up with incoming data. And, as we found through
+multiple trials, whether there are any problems with specific consumers
getting stuck.
+
+Taking this one step further, it's probably more useful if we can watch
the offsets and lags change over time. In order to do this
+we'll add a "watch" command and set the refresh rate to 10 seconds.
+
+```
+watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
+ --command-config=/tmp/consumergroup.config \
+ --describe \
+ --group enrichments \
+ --bootstrap-server $BROKERLIST \
+ --new-consumer
+```
+
+Every 10 seconds the command will re-run and the screen will be refreshed
with new information. The most useful bit is that the
+watch command will highlight the differences from the current output and
the last output screens.
+
+We can also monitor our Storm topologies by using the Storm UI - see
http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
+
+And lastly, you can leverage some GUI tooling to make creating and
modifying your Kafka topics a bit easier -
+see https://github.com/yahoo/kafka-manager
+
+## General Knobs and Levers
--- End diff --
Please spell out "number of", rather than using '#' as an abbreviation,
since the sharp sign conflicts with markdown syntax. Better yet, does it make
sense to actually give the config parameter name? Or a statement of where and
what one changes in the Ambari UI? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---