Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/2241#discussion_r158009771 --- Diff: docs/Performance.md --- @@ -0,0 +1,128 @@ +--- +title: Performance Tuning +layout: documentation +documentation: true +--- + +Latency, throughput and CPU consumption are the three key dimensions involved in performance tuning. +In the following sections we discuss the settings that can used to tune along these dimension and understand the trade-offs. + +It is important to understand that these settings can vary depending on the topology, the type of hardware and the number of hosts used by the topology. + +## 1. Batch Size +Spouts and Bolts communicate with each other via concurrent message queues. The batch size determines the number of messages to be buffered before +the producer (spout/bolt) attempts to actually write to the downstream component's message queue. Inserting messages in batches to downstream +queues helps reduce the number of synchronization operations required for the inserts. Consequently this helps achieve higher throughput. However, +sometimes it may take a little time for the buffer to fill up, before it is flushed into the downstream queue. This implies that the buffered messages +will take longer to become visible to the downstream consumer who is waiting to process them. This can increase the average end-to-end latency for +these messages. The latency can get very bad if the batch sizes are large and the topology is not experiencing high traffic. + +`topology.producer.batch.size` : The batch size for writes into the receive queue of any spout/bolt is controlled via this setting. This setting +impacts the communication within a worker process. Each upstream producer maintains a separate batch to a component's receive queue. So if two spout +instances are writing to the same downstream bolt instance, each of the spout instances will have maintain a separate batch. + +`topology.transfer.batch.size` : Messages that are destined to a spout/bolt running on a different worker process, are sent to a queue called +the **Worker Transfer Queue**. The Worker Transfer Thread is responsible for draining the messages in this queue and send them to the appropriate +worker process over the network. This setting controls the batch size for writes into the Worker Transfer Queue. This impacts the communication +between worker processes. + +#### Guidance + +**For Low latency:** Set batch size to 1. This basically disables batching. This is likely to reduce peak sustainable throughput under heavy traffic, but +not likely to impact throughput much under low/medium traffic situations. +**For High throughput:** Set batch size > 1. Try values like 10, 100, 1000 or even higher and see what yields the best throughput for the topology. +Beyond a certain point the throughput is likely to get worse. +**Varying throughput:** Topologies often experience fluctuating amounts of incoming traffic over the day. Other topos may experience higher traffic in some +paths and lower throughput in other paths simultaneously. If latency is not a concern, a small bach size (e.g. 10) and in conjunction with the right flush +frequency may provide a reasonable compromise for such scenarios. For meeting stricter latency SLAs, consider setting it to 1. + + +## 2. Flush Tuple Frequency +In low/medium traffic situations or when batch size is too large, the batches may take too long to fill up and consequently the messages could take unacceptably +long time to become visible to downstream components. In such case, periodic flushing of batches is necessary to keep the messages moving and avoid compromising +latencies when batching is enabled. + +When batching has been enabled, special messages called *flush tuples* are inserted periodically into the receive queues of all spout and bolt instances. +This causes each spout/bolt instance to flush all its outstanding batches to their respective downstream components. + +`topology.flush.tuple.freq.millis` : This setting controls how often the flush tuples are generated. Flush tuples are not generated if this configuration is +set to 0 or if (`topology.producer.batch.size`=1 and `topology.transfer.batch.size`=1). + + +#### Guidance +Flushing interval can be used as tool to retain the higher throughput benefits of batching and avoid batched messages getting stuck for too long waiting for their. +batch to fill. Preferably this value should be larger than the average execute latencies of the bolts in the topology. Trying to flush the queues more frequently than +the amount of time it takes to produce the messages may hurt performance. Understanding the average execute latencies of each bolt will help determine the average +number of messages in the queues between two flushes. + +**For Low latency:** A smaller value helps achieve tighter latency SLAs. +**For High throughput:** When trying to maximize throughput under high traffic situations, the batches are likely to get filled and flushed automatically. +To optimize for such cases, this value can be set to a higher number. +**Varying throughput:** If latency is not a concern, a larger value will optimize for high traffic situations. For meeting tighter SLAs set this to lower +values. + + +## 3. Wait Strategy +Wait strategies are used to conserve CPU usage by trading off some latency and throughput. They are applied for the following situations: + +3.1 **Spout Wait:** In low/no traffic situations, Spout's nextTuple() may not produce any new emits. To prevent invoking the Spout's nextTuple, +this wait strategy is used between nextTuple() calls to allow the spout's executor thread to idle and conserve CPU. Select a strategy using `topology.spout.wait.strategy`. + +3.2 **Bolt Wait:** : When a bolt polls it's receive queue for new messages to process, it is possible that the queue is empty. This typically happens +in case of low/no traffic situations or when the upstream spout/bolt is inherently slower. This wait strategy is used in such cases. It avoids high CPU usage +due to the bolt continuously checking on a typically empty queue. Select a strategy using `topology.bolt.wait.strategy`. The chosen strategy can be further configured +using the `topology.bolt.wait.*` settings. + +3.3 **Backpressure Wait** : Select a strategy using `topology.backpressure.wait.strategy`. When a spout/bolt tries to write to a downstream component's receive queue, +there is a possibility that the queue is full. In such cases the write needs to be retried. This wait strategy is used to induce some idling in-between re-attempts for +conserving CPU. The chosen strategy can be further configured using the `topology.backpressure.wait.*` settings. + + +#### Built-in wait strategies: + +- **SleepSpoutWaitStrategy** : This is the only built-in strategy available for Spout Wait. It cannot be applied to other Wait situations. It is a simple static strategy that +calls Thread.sleep() each time. Set `topology.spout.wait.strategy` to `org.apache.storm.spout.SleepSpoutWaitStrategy` for using this. `topology.sleep.spout.wait.strategy.time.ms` +configures the sleep time. + +- **ProgressiveWaitStrategy** : This strategy can be used for Bolt Wait or Backpressure Wait situations. Set the strategy to 'org.apache.storm.policy.WaitStrategyProgressive' to +select this wait strategy. This is a dynamic wait strategy that enters into progressively deeper states of CPU conservation if the Backpressure Wait or Bolt Wait situations persist. +It has 3 levels of idling and allows configuring how long to stay at each level : + + 1. No Waiting - The first few times it will return immediately. This does not conserve any CPU. The number of times it remains in this state is configured using + `topology.bolt.wait.progressive.level1.count` or `topology.backpressure.wait.progressive.level1.count` depending which situation it is being used. + + 2. Park Nanos - In this state it disables the current thread for thread scheduling purposes, for 1 nano second using LockSupport.parkNanos(). This puts the CPU in a minimal + conservation state. It remains in this state for `topology.bolt.wait.progressive.level2.count` or `topology.backpressure.wait.progressive.level2.count` iterations. + + 3. Thread.sleep() - In this state it calls Thread.sleep() with the value specified in `topology.backpressure.wait.progressive.level3.sleep.millis` or in + `topology.bolt.wait.progressive.level3.sleep.millis` based on the Wait situation it is used in. This is the most CPU conserving level it remains in this level for + the remaining iterations. + + +- **ParkWaitStrategy** : This strategy can be used for Bolt Wait or Backpressure Wait situations. Set the strategy to `org.apache.storm.policy.WaitStrategyPark` to use this. +This strategy disables the current thread for thread scheduling purposes by calling LockSupport.parkNanos(). The amount of park time is configured using either +`topology.bolt.wait.park.microsec` or `topology.backpressure.wait.park.microsec` based on the wait situation it is used. Setting the park time to 0, effectively disables +invocation of LockSupport.parkNanos and this mode can be used to achieve busy polling (which at the cost of high CPU utilization even when idle, may improve latency and/or throughput). + + +## Max.spout.pending +The back pressure mechanism no longer requires `topology.max.spout.pending`. It is recommend to set this to null (default). --- End diff -- According to the report posted in issue, setting this option still shows better result and hence the tests in the report utilizes the option. Do we want to guide it as well, or still be better to recommend users to set this to null?
---