Github user knusbaum commented on a diff in the pull request: https://github.com/apache/storm/pull/2289#discussion_r135074782 --- Diff: examples/storm-loadgen/README.md --- @@ -0,0 +1,195 @@ +# Storm Load Generation Tools + +A set of tools to place an artificial load on a storm cluster to compare against a different storm cluster. This is particularly helpful when making changes to the data path in storm to see what if any impact the changes had. This is also useful for end users that want to compare different hardware setups to see what the trade-offs are, although actually running your real topologies is going to be more accurate. + +## Methodology +The idea behind all of these tools is to measure the trade-offs between latency, throughput, and cost when processing data using Apache Storm. + +When processing data you typically will know a few things. First you will know about how much data you are going to be processing. This will typically be a range of values that change throughput the day. You also will have an idea of how quickly you need the data processed by. Often this is measured in terms of the latency it takes to process data at the some percentile or set of percentiles. This is because of most use cases the value of the data declines over time, and being able to react to the data quickly is more valuable. You probably also have a budget for how much you are willing to spend to be able to process this data. There are always trade-offs in how quickly you can process some data and how efficiently you can processes that data both in terms of resource usage (cost) and latency. These tools are designed to help you explore that space. + +A note on how latency is measured. Storm typically measures latency from when a message is emitted by a spout until the point it is fully acked or failed (in many versions of storm it actually does this in the acker instead of the spout so it is trying to be a measure of how long it takes for the actual processing, removing as much of the acker overhead as possible). For these tools we do it differently. We simulate a throughput and measure the start time of the tuple from when it would have been emitted if the topology could keep up with the load. In the normal case this should not be an issue, but if the topology cannot keep up with the throughput you will see the latency grow very high compared to the latency reported by storm. + +## Tools +### CaptureLoad + +`CaptureLoad` will look at the topologies on a running cluster and store the structure of and metrics about each of theses topologies storing them in a format that can be used later to reproduce a similar load on the cluster. + +#### Usage +``` +storm jar storm-loadgen.jar org.apache.storm.loadgen.CaptureLoad [options] [topologyName]* +``` +|Option| Description| +|-----|-----| +|-a,--anonymize | Strip out any possibly identifiable information| +| -h,--help | Print a help message | +| -o,--output-dir <file> | Where to write (defaults to ./loadgen/)| + +#### Limitations +This is still a work in progress. It does not currently capture CPU or memory usage of a topology. Resource requests (used by RAS when scheduling) within the topology are also not captured yet, nor is the user that actually ran the topology. + +### GenLoad + +`GenLoad` will take the files produced by `CaptureLoad` and replay them in a simulated way on a cluster. It also offers lots of ways to capture metrics about those simulated topologies to be able to compare different software versions of different hardware setups. You can also make adjustments to the topology before submitting it to change the size or throughput of the topology. + +### Usage +``` +storm jar storm-loadgen.jar org.apache.storm.loadgen.GenLoad [options] [capture_file]* +``` + +|Option| Description| +|-----|-----| +| --debug | Print debug information about the adjusted topology before submitting it. | +|-h,--help | Print a help message | +| --local-or-shuffle | Replace shuffle grouping with local or shuffle grouping. | +| --parallel <MULTIPLIER(:TOPO:COMP)?> | How much to scale the topology up or down in parallelism. The new parallelism will round up to the next whole number. If a topology + component is supplied only that component will be scaled. If topo or component is blank or a `'*'` all topologies or components matched the other part will be scaled. Only 1 scaling rule, the most specific, will be applied to a component. Providing a topology name is considered more specific than not providing one. (defaults to 1.0 no scaling) | +| -r,--report-interval <INTERVAL_SECS> | How long in between reported metrics. Will be rounded up to the next 10 sec boundary. default 30 | +| --reporter <TYPE:FILE?OPTIONS> | Provide the config for a reporter to run. See below for more information about these | +| -t,--test-time <MINS> | How long to run the tests for in mins (defaults to 5) | +| --throughput <MULTIPLIER(:TOPO:COMP)?> | How much to scale the topology up or down in throughput. If a topology + component is supplied only that component will be scaled. If topo or component is blank or a `'*'` all topologies or components matched will be scaled. Only 1 scaling rule, the most specific, will be applied to a component. Providing a topology name is considered more specific than not providing one.(defaults to 1.0 no scaling)| +| -w,--report-window <INTERVAL_SECS> | How long of a rolling window should be in each report. Will be rounded up to the next report interval boundary. default 30| + +## ThroughputVsLatency +This is a topology similar to `GenLoad` in most ways, except instead of simulating a load it runs a word count algorithm. --- End diff -- In what ways is it similar? This sentence implies that it does not simulate load, which seems to be the main purpose of GenLoad.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---