Hi everyone, I've been a busy bee working on getting a performance testing cluster ready for regular performance tests. I'm happy to say that there is a successful-looking load-test running right now, though there is more work to do on the tests themselves and the automation.
Here is a summary of next steps that are on my radar: * Collection and organization of all information available (as noted under PROFILING / MONITORING), and dumping them to a date/time-named directory on the load-driver machine for web access * Orchestrating of cron jobs (or just put together a synchronous script) to re-execute tests on a regular basis * Put together an automated Perf4J-enabled Jenkins build that can be used as the performance-testing platform for API execution / timing data * Ongoing tuning of performance test scripts * Incrementally add data to the performance testing environment as needed using the model loader * Look at use of "Tsung Plotter" to generate comparison graphs of metrics over multiple runs Here is the Tsung data being generated right now: http://oae-loader.sakaiproject.org/20120807-2224/report.html Here are some pointers on how to read it: http://tsung.erlang-projects.org/user_manual.html#htoc70 I've provided a detailed update and "practice documentation" below on the different facets that pull the testing together. As always questions and suggestions are welcome. ENVIRONMENT Special thanks to Kyle and Erik with rSmart for all their help getting an Amazon cluster spinning for us. Our cluster is running like a champ. Configuration and server deployment is full managed with puppet [1], and puppet rocks. This is what the cluster looks like: 2x App server node (one is sleeping for now) 1x Postgres node 1x Solr node 1x Preview processor node 1x Apache node ------ 1x Load-test driver node The Load-test driver node is the client machine that actually runs the load tests, so not a server component. PROFILING / MONITORING Operating system resources are being monitored and recorded using Munin [2]. Munin is a popular open-source tool that lets you record more OS (and auxiliary stuff like apache processes) than I knew existed. More importantly, Tsung integrates with it quite well with it to pull synchronized OS metrics (CPU, Load, Memory) from *all* cluster nodes during the performance test. Those metrics now become an artifact of the Tsung load-testing reports. YourKit [3] has been worked in to the puppet scripts, such that it is painless to enable it for a more in-depth look at the app server. If enabled, YourKit will run as a Java Agent on the app server, and will dump a snapshot when the JVM is shut down -- presumably after the performance run. This allows us to get point-in-time analysis of CPU and Thread telemetry throughout the performance test. We can enable object allocation analysis as well, if needed. YourKit is not enabled by default because it may introduce performance degradation during the tests. At this very moment in time the OS resources are the only server profiling metrics being packaged on every single performance test. Immediate plans are to automate the packaging of the following, which are already being recorded regularly on the servers: * Verbose GC logs of the app server node and Solr * Full app server log * Postgres slow-query log * Grab a snapshot of the Solr admin console to see solr cache usage * Grab a snapshot of the /system/telemetry page * Perf4J logs. This will provide (unsynchronized) time-series API execution timing as well as additional telemetry throughout the test. * Maybe a full Munin cluster snapshot, but it is probably not granular enough to be useful. LOAD TESTING RESULTS Getting the client-side load-testing results with Tsung is a no-brainer. It generates the static HTML pages that I'm currently hosting here: http://oae-loader.sakaiproject.org/ - I will be restructuring these directories as I automate the collection of more information (i.e., logs). But if you go into any directory and click "report.html", that will get you started. The most valuable information I've found so far from the Tsung reports: Transactions. If we hold a convention to separate each user-facing action (e.g., "View Contacts") by a unique transaction name when writing tests, this provides a good view into how well those individual actions perform. We will need to edit a couple test cases to be consistent on the "transaction id" so we can get more accurate data -- for example, the highest 10sec mean of "tr_login" is over a minute, while the smallest is less than a second, which is because there are multiple transactions keyed with "login" that perform many more requests than the others. Also, the "Count" column gives a good indication of action distribution across user sessions and can help guide balancing the tests properly. Load. Simply the load on each cluster member throughout the load test. HTTP Status Code. The number of HTTP Status codes, time-series across the duration of the test. DATA LOADING For regular performance tests, snapshots have been taken of previously-loaded data that can be dumped back in, which is much faster than using the model builder. This has already been automated, and can be run nightly in preparation for load tests. Data is loaded into the server using the OAE-model-loader [4]. I've recently submit a PR which allows us to incrementally load data batches on top of one another to step up the content when needed. The current test that is running is using 500 users, 250 worlds. I can load in another batch of 500/250 and take a snapshot. Then another. Additionally, the data that feeds into the load-tests (i.e., the CSV files) will are easily generated from the OAE-model-loader source scripts, and can be done incrementally as well, therefore we can successfully performance test each increment. Here is an example workflow: 1. Use generate.js to generate 10 batches of data, 500 users 900 worlds per batch. -- 2. Load 1 batch of data into OAE: node loaddata.js -b 1 3. Assemble the package of load-testing CSV data that goes along with that data: node performance-testing/package.js -b 1 -s . 4. Run performance test with CSV data -- 5. Load a second batch of data into OAE: node loaddata.js -s 1 -b 2 6. Assemble the package of load-testing CSV data that goes along with what is currently in the OAE environment: node performance-testing/package.js -b 2 -s . 7. Run performance test with new CSV data You get the picture. Plan is to incrementally add content to become a data-set comparable to the reference environment [5], but doing it all at once was unstable. AUTOMATION This is where I'm at now that I'm actually spinning tests. What is currently automated: * Purging/Restoring data, and bouncing the servers for a new test. I imagine this can be a cronjob that could execute about 15min before the Tsung cron-job is kicked off. * Publication of Tsung results. This is only because I'm simply pointing Apache to the Tsung performance results directory. Next steps for automation: * Collection and organization of all information available (as noted under PROFILING / MONITORING), and dumping them to a date/time-named directory on the load-driver machine. * Orchestrating of cron jobs (or just put together a synchronous script) to re-execute tests on a regular basis * Put together an automated Perf4J-enabled Jenkins build that can be used as the performance-testing platform for API execution / timing data Thanks for reading! :) -- Cheers, Branden _______________________________________________ oae-dev mailing list oae-dev@collab.sakaiproject.org http://collab.sakaiproject.org/mailman/listinfo/oae-dev