Github user simonellistonball commented on a diff in the pull request:

    https://github.com/apache/metron/pull/614#discussion_r120961066
  
    --- Diff: metron-platform/Performance-tuning-guide.md ---
    @@ -0,0 +1,326 @@
    +# Metron Performance Tunining Guide
    +
    +## Overview
    +
    +This document provides guidance from our experiences tuning the Apache 
Metron Storm topologies for maximum performance. You'll find
    +suggestions for optimum configurations under a 1 gbps load along with some 
guidance around the tooling we used to monitor and assess
    +our throughput.
    +
    +In the simplest terms, Metron is a streaming architecture created on top 
of Kafka and three main types of Storm topologies: parsers,
    +enrichment, and indexing. Each parser has it's own topology and there is 
also a highly performant, specialized spout-only topology
    +for streaming PCAP data to HDFS. We found that the architecture can be 
tuned almost exclusively through using a few primary Storm and
    +Kafka parameters along with a few Metron-specific options. You can think 
of the data flow as being similar to water flowing through a
    +pipe, and the majority of these options assist in tweaking the various 
pipe widths in the system.
    +
    +## General Suggestions
    +
    +Note that there is currently no method for specifying the number of tasks 
from the number of executors in Flux topologies (enrichment,
    + indexing). By default, the number of tasks will equal the number of 
executors. Logically, setting the number of tasks equal to the number
    +of executors is sensible. Storm enforces # executors <= # tasks. The 
reason you might set the number of tasks higher than the number of
    +executors is for future performance tuning and rebalancing without the 
need to bring down your topologies. The number of tasks is fixed
    +at topology startup time whereas the number of executors can be increased 
up to a maximum value equal to the number of tasks.
    +
    +We found that the default values for poll.timeout.ms, 
offset.commit.period.ms, and max.uncommitted.offsets worked well in nearly all 
cases.
    +As a general rule, it was optimal to set spout parallelism equal to the 
number of partitions used in your Kafka topic. Any greater
    +parallelism will leave you with idle consumers since Kafka limits the max 
number of consumers to the number of partitions. This is
    +important because Kafka has certain ordering guarantees for message 
delivery per partition that would not be possible if more than
    +one consumer in a given consumer group were able to read from that 
partition.
    +
    +## Tooling
    +
    +Before we get to the actual tooling used to monitor performance, it helps 
to describe what we might actually want to monitor and potential
    +pain points. Prior to switching over to the new Storm Kafka client, which 
leverages the new Kafka consumer API under the hood, offsets
    +were stored in Zookeeper. While the broker hosts are still stored in 
Zookeeper, this is no longer true for the offsets which are now
    +stored in Kafka itself. This is a configurable option, and you may switch 
back to Zookeeper if you choose, but Metron is currently using
    +the new defaults. This is useful to know as you're investigating both 
correctness as well as throughput performance.
    +
    +First we need to setup some environment variables
    +```
    +export BROKERLIST=<your broker comma-delimated list of host:ports>
    +export ZOOKEEPER=<your zookeeper comma-delimated list of host:ports>
    +export KAFKA_HOME=<kafka home dir>
    +export METRON_HOME=<your metron home>
    +export HDP_HOME=<your HDP home>
    +```
    +
    +If you have Kerberos enabled, setup the security protocol
    +```
    +$ cat /tmp/consumergroup.config
    +security.protocol=SASL_PLAINTEXT
    +```
    +
    +Now run the following command for a running topology's consumer group. In 
this example we are using enrichments.
    +```
    +${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +This will return a table with the following output depicting offsets for 
all partitions and consumers associated with the specified
    +consumer group:
    +```
    +GROUP                          TOPIC              PARTITION  
CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
    +enrichments                    enrichments        9          29746066      
  29746067        1               consumer-2_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        3          29754325      
  29754326        1               consumer-1_/xxx.xxx.xxx.xxx
    +enrichments                    enrichments        43         29754331      
  29754332        1               consumer-6_/xxx.xxx.xxx.xxx
    +...
    +```
    +
    +_Note_: You won't see any output until a topology is actually running 
because the consumer groups only exist while consumers in the
    +spouts are up and running.
    +
    +The primary column we're concerned with paying attention to is the LAG 
column, which is the current delta calculation between the
    +current and end offset for the partition. This tells us how close we are 
to keeping up with incoming data. And, as we found through
    +multiple trials, whether there are any problems with specific consumers 
getting stuck.
    +
    +Taking this one step further, it's probably more useful if we can watch 
the offsets and lags change over time. In order to do this
    +we'll add a "watch" command and set the refresh rate to 10 seconds.
    +
    +```
    +watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +    --command-config=/tmp/consumergroup.config \
    +    --describe \
    +    --group enrichments \
    +    --bootstrap-server $BROKERLIST \
    +    --new-consumer
    +```
    +
    +Every 10 seconds the command will re-run and the screen will be refreshed 
with new information. The most useful bit is that the
    +watch command will highlight the differences from the current output and 
the last output screens.
    +
    +We can also monitor our Storm topologies by using the Storm UI - see 
http://www.malinga.me/reading-and-understanding-the-storm-ui-storm-ui-explained/
    +
    +And lastly, you can leverage some GUI tooling to make creating and 
modifying your Kafka topics a bit easier -
    +see https://github.com/yahoo/kafka-manager
    +
    +## General Knobs and Levers
    +
    +Kafka
    +    - # partitions
    +Storm
    +    Kafka
    +        - polling frequency and timeouts
    +    - # workers
    +    - ackers
    +    - max spout pending
    +    - spout parallelism
    +    - bolt parallelism
    +    - # executors
    +Metron
    +    - bolt cache size - handles how many messages can be cached. This 
cache is used while waiting for all parts of the message to be rejoined.
    +
    +## Topologies
    +
    +### Parsers
    +
    +The parsers and PCAP use a builder utility, as opposed to enrichments and 
indexing, which use Flux.
    +
    +We set the number of partitions for our inbound Kafka topics to 48.
    +
    +```
    +$ cat ~metron/.storm/storm-bro.config
    +
    +{
    +    ...
    +    "topology.max.spout.pending" : 2000
    +    ...
    +}
    +```
    +
    +These are the spout recommended defaults from Storm and are currently the 
defaults provided in the Kafka spout itself.
    +In fact, if you find the recommended defaults work fine for you, then this 
file might not be necessary at all.
    +```
    +$ cat ~/.storm/spout-bro.config
    +{
    +    ...
    +    "spout.pollTimeoutMs" : 200,
    +    "spout.maxUncommittedOffsets" : 10000000,
    +    "spout.offsetCommitPeriodMs" : 30000
    +}
    +```
    +
    +We ran our bro parser topology with the following options
    --- End diff --
    
    Is there any explanation of how we got to these numbers? There will be 
heavily dependent on hardware, particularly cpu and disk configuration, so the 
context could be very important. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to