I didn't fill in *any* of my footnotes... [1] https://github.com/mrvisser/puppet-oae-example/tree/oae-loadtesting-config [2] http://munin-monitoring.org/ [3] http://yourkit.com/features/index.jsp [4] https://github.com/mrvisser/OAE-model-loader/tree/performance-testing [5] https://oae-community.sakaiproject.org/content#p=lb5CsAfmg/Performance.pdf
On Tue, Aug 7, 2012 at 9:15 PM, Branden Visser <mrvis...@gmail.com> wrote: > Hi everyone, > > I've been a busy bee working on getting a performance testing cluster > ready for regular performance tests. I'm happy to say that there is a > successful-looking load-test running right now, though there is more > work to do on the tests themselves and the automation. > > Here is a summary of next steps that are on my radar: > > * Collection and organization of all information available (as noted > under PROFILING / MONITORING), and dumping them to a date/time-named > directory on the load-driver machine for web access > * Orchestrating of cron jobs (or just put together a synchronous > script) to re-execute tests on a regular basis > * Put together an automated Perf4J-enabled Jenkins build that can be > used as the performance-testing platform for API execution / timing > data > * Ongoing tuning of performance test scripts > * Incrementally add data to the performance testing environment as > needed using the model loader > * Look at use of "Tsung Plotter" to generate comparison graphs of > metrics over multiple runs > > Here is the Tsung data being generated right now: > http://oae-loader.sakaiproject.org/20120807-2224/report.html > Here are some pointers on how to read it: > http://tsung.erlang-projects.org/user_manual.html#htoc70 > > I've provided a detailed update and "practice documentation" below on > the different facets that pull the testing together. As always > questions and suggestions are welcome. > > ENVIRONMENT > > Special thanks to Kyle and Erik with rSmart for all their help getting > an Amazon cluster spinning for us. Our cluster is running like a > champ. Configuration and server deployment is full managed with puppet > [1], and puppet rocks. This is what the cluster looks like: > > 2x App server node (one is sleeping for now) > 1x Postgres node > 1x Solr node > 1x Preview processor node > 1x Apache node > ------ > 1x Load-test driver node > > The Load-test driver node is the client machine that actually runs the > load tests, so not a server component. > > PROFILING / MONITORING > > Operating system resources are being monitored and recorded using > Munin [2]. Munin is a popular open-source tool that lets you record > more OS (and auxiliary stuff like apache processes) than I knew > existed. More importantly, Tsung integrates with it quite well with it > to pull synchronized OS metrics (CPU, Load, Memory) from *all* cluster > nodes during the performance test. Those metrics now become an > artifact of the Tsung load-testing reports. > > YourKit [3] has been worked in to the puppet scripts, such that it is > painless to enable it for a more in-depth look at the app server. If > enabled, YourKit will run as a Java Agent on the app server, and will > dump a snapshot when the JVM is shut down -- presumably after the > performance run. This allows us to get point-in-time analysis of CPU > and Thread telemetry throughout the performance test. We can enable > object allocation analysis as well, if needed. YourKit is not enabled > by default because it may introduce performance degradation during the > tests. > > At this very moment in time the OS resources are the only server > profiling metrics being packaged on every single performance test. > Immediate plans are to automate the packaging of the following, which > are already being recorded regularly on the servers: > > * Verbose GC logs of the app server node and Solr > * Full app server log > * Postgres slow-query log > * Grab a snapshot of the Solr admin console to see solr cache usage > * Grab a snapshot of the /system/telemetry page > * Perf4J logs. This will provide (unsynchronized) time-series API > execution timing as well as additional telemetry throughout the test. > * Maybe a full Munin cluster snapshot, but it is probably not granular > enough to be useful. > > LOAD TESTING RESULTS > > Getting the client-side load-testing results with Tsung is a > no-brainer. It generates the static HTML pages that I'm currently > hosting here: http://oae-loader.sakaiproject.org/ - I will be > restructuring these directories as I automate the collection of more > information (i.e., logs). But if you go into any directory and click > "report.html", that will get you started. > > The most valuable information I've found so far from the Tsung reports: > > Transactions. If we hold a convention to separate each user-facing > action (e.g., "View Contacts") by a unique transaction name when > writing tests, this provides a good view into how well those > individual actions perform. We will need to edit a couple test cases > to be consistent on the "transaction id" so we can get more accurate > data -- for example, the highest 10sec mean of "tr_login" is over a > minute, while the smallest is less than a second, which is because > there are multiple transactions keyed with "login" that perform many > more requests than the others. Also, the "Count" column gives a good > indication of action distribution across user sessions and can help > guide balancing the tests properly. > > Load. Simply the load on each cluster member throughout the load test. > > HTTP Status Code. The number of HTTP Status codes, time-series across > the duration of the test. > > DATA LOADING > > For regular performance tests, snapshots have been taken of > previously-loaded data that can be dumped back in, which is much > faster than using the model builder. This has already been automated, > and can be run nightly in preparation for load tests. > > Data is loaded into the server using the OAE-model-loader [4]. I've > recently submit a PR which allows us to incrementally load data > batches on top of one another to step up the content when needed. The > current test that is running is using 500 users, 250 worlds. I can > load in another batch of 500/250 and take a snapshot. Then another. > Additionally, the data that feeds into the load-tests (i.e., the CSV > files) will are easily generated from the OAE-model-loader source > scripts, and can be done incrementally as well, therefore we can > successfully performance test each increment. Here is an example > workflow: > > 1. Use generate.js to generate 10 batches of data, 500 users 900 > worlds per batch. > -- > 2. Load 1 batch of data into OAE: node loaddata.js -b 1 > 3. Assemble the package of load-testing CSV data that goes along with > that data: node performance-testing/package.js -b 1 -s . > 4. Run performance test with CSV data > -- > 5. Load a second batch of data into OAE: node loaddata.js -s 1 -b 2 > 6. Assemble the package of load-testing CSV data that goes along with > what is currently in the OAE environment: node > performance-testing/package.js -b 2 -s . > 7. Run performance test with new CSV data > > You get the picture. > > Plan is to incrementally add content to become a data-set comparable > to the reference environment [5], but doing it all at once was > unstable. > > AUTOMATION > > This is where I'm at now that I'm actually spinning tests. What is > currently automated: > > * Purging/Restoring data, and bouncing the servers for a new test. I > imagine this can be a cronjob that could execute about 15min before > the Tsung cron-job is kicked off. > * Publication of Tsung results. This is only because I'm simply > pointing Apache to the Tsung performance results directory. > > Next steps for automation: > > * Collection and organization of all information available (as noted > under PROFILING / MONITORING), and dumping them to a date/time-named > directory on the load-driver machine. > * Orchestrating of cron jobs (or just put together a synchronous > script) to re-execute tests on a regular basis > * Put together an automated Perf4J-enabled Jenkins build that can be > used as the performance-testing platform for API execution / timing > data > > Thanks for reading! :) > > -- > Cheers, > Branden _______________________________________________ oae-dev mailing list oae-dev@collab.sakaiproject.org http://collab.sakaiproject.org/mailman/listinfo/oae-dev