Repository: systemml Updated Branches: refs/heads/gh-pages 5c3c2f27d -> bf0245c69
[SYSTEMML-1451] phase 2 work Completed these tasks as part for Phase 2 for Google Summer of Code '17 - Decouple systemml-spark-submit.py - Decouple systemml-standalone.py - Refractor perf test suit to accept args like debug, stats, config etc... - Add HDFS support - Google Docs support - Compare SystemML with previous versions - Pylint, Comment - Extra arguments configuration Test - Windows Test - Doc update - systemml standalone comments - systemml spark submit comments Closes #575 Project: http://git-wip-us.apache.org/repos/asf/systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/83b9a221 Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/83b9a221 Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/83b9a221 Branch: refs/heads/gh-pages Commit: 83b9a221e5bee8a0232d5a470702944aab056fb4 Parents: 5c3c2f2 Author: krishnakalyan3 <[email protected]> Authored: Tue Aug 1 13:46:30 2017 -0700 Committer: Nakul Jindal <[email protected]> Committed: Tue Aug 1 13:46:30 2017 -0700 ---------------------------------------------------------------------- python-performance-test.md | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/systemml/blob/83b9a221/python-performance-test.md ---------------------------------------------------------------------- diff --git a/python-performance-test.md b/python-performance-test.md index c265bc6..02d3e34 100644 --- a/python-performance-test.md +++ b/python-performance-test.md @@ -11,10 +11,13 @@ Our performance tests suit contains `7` families namely `binomial`, `multinomial On a very high level use construct a string with arguments required to run each operation. Once this string is constructed we use the subprocess module to execute this string and extract time from the standard out. -We also use `json` module write our configurations to a json file. This ensure that our current operation is easy to debug. +We also use `json` module write our configurations to a json file. This ensure that our operations are easy to debug. +We have `7` files in performance test suit: -We have `5` files in performance test suit `run_perftest.py`, `datagen.py`, `train.py`, `predict.py` and `utils.py`. +- Entry File `run_perftest.py` +- Supporting Files `datagen.py`, `train.py`, `predict.py` +- Utility Files `utils_exec.py`, `utils_fs.py`, `utils_misc.py` `datagen.py`, `train.py` and `predict.py` generate a dictionary. Our key is the name of algorithm being processed and values is a list with path(s) where all the data required is present. We define this dictionary as a configuration packet. @@ -28,7 +31,7 @@ In `train.py` script we have functions required to generate training output. We The file `predict.py` contains all functions for all algorithms in the performance test that contain predict script. We return the required configuration packet as a result of this script, that contains key as the algorithm to run and values with location to read predict json files from. -In the file `utils.py` we have all the helper functions required in our performance test. These functions do operations like write `json` files, extract time from std out etc. +In the file(s) `utils_*.py` we have all the helper functions required in our performance test. These functions do operations like write `json` files, extract time from std out etc. ### Adding New Algorithms While adding a new algorithm we need know if it has to be part of the any pre existing family. If this algorithm depends on a new data generation script we would need to create a new family. Steps below to take below to add a new algorithm. @@ -75,7 +78,7 @@ Default setting for our performance test below: - Matrix size to 10,000 rows and 100 columns. - Execution mode `singlenode`. - Operation modes `data-gen`, `train` and `predict` in sequence. -- Matrix type set to `all`. Which will generate `dense` or / and `sparse` matrices for all relevant algorithms. +- Matrix type set to `all`. Which will generate `dense`, `sparse` matrices for all relevant algorithms. ### Examples Some examples of SystemML performance test with arguments shown below: @@ -104,6 +107,9 @@ Run performance test for the algorithms `m-svm` with `multinomial` family. Run o ` Run performance test for all algorithms under the family `regression2` and log with filename `new_log`. +`./scripts/perftest/python/run_perftest.py --family binomial clustering multinomial regression1 regression2 stats1 stats2 --config-dir /Users/krishna/open-source/systemml/scripts/perftest/temp3 --temp-dir hdfs://localhost:9000/temp3` +Run performance test for all algorithms using HDFS. + ### Operational Notes All performance test depend mainly on two scripts for execution `systemml-standalone.py` and `systemml-spark-submit.py`. Incase we need to change standalone or spark parameters we need to manually change these parameters in their respective scripts. @@ -117,13 +123,26 @@ multinomial|data-gen|0|dense|10k_100| 0.33 MultiLogReg|train|0|10k_100|dense|6.956 MultiLogReg|predict|0|10k_100|dense|4.780 -These logs can be found in `temp` folder (`$SYSTEMML_HOME/scripts/perftest/temp`) in-case not overridden by `--temp-dir`. This `temp` folders also contain the data generated during our performance test. +These logs and config `json` files can be found in `temp` folder (`$SYSTEMML_HOME/scripts/perftest/temp`) in-case not overridden by `--config-dir`. + +`--temp-dir` by default points to local file system. We can change this to point to a hdfs path by `--temp-dir hdfs://localhost:9000/temp` where all files generated during execution will be saved. + +Every time a script executes in `data-gen` mode successfully, we write a `_SUCCESS` file. If this file exists we ensures that re-runs of the same script is not possible. Support for configuration options like `-stats`, `-explain`, `--conf` have also been added. + +Results obtained by our performance tests can be automatically uploaded to google docs. + +`./update.py --file ../temp/singlenode.out --exec-mode singlenode --auth client_json.json --tag 1.0` + +In the example above `--tag` can be a major/minor systemml version and `--auth` points to the `json` key required by `google docs`. + +Currently we only support time difference between algorithms in different versions. This can be obtained by running the script below +`./stats.py --auth client_json.json --exec-mode singlenode --tags 1.0 2.0` -Every time a script executes in `data-gen` mode successfully, we write a `_SUCCESS` file. If this file exists we ensures that re-run of the same script is not possible as data already exists. +Note: Please pip install `https://github.com/burnash/gspread` to use google docs client. ### Troubleshooting We can debug the performance test by making changes in the following locations based on -- Please see `utils.py` function `exec_dml_and_parse_time`. In uncommenting the debug print statement in the function `exec_dml_and_parse_time`. This allows us to inspect the subprocess string being executed. +- Please see `utils_exec.py` function `subprocess_exec`. - Please see `run_perftest.py`. Changing the verbosity level to `0` allows us to log more information while the script runs. -- Eyeballing the json files generated and making sure the arguments are correct. +- Eyeballing the json files generated and making sure the configuration arguments are correct.
