[1/2] systemml git commit: [SYSTEMML-1451] phase 2 work

deron Thu, 03 Aug 2017 14:10:27 -0700

Repository: systemml
Updated Branches:
  refs/heads/gh-pages 5c3c2f27d -> bf0245c69



[SYSTEMML-1451] phase 2 work

Completed these tasks as part for Phase 2 for Google Summer of Code '17
- Decouple systemml-spark-submit.py
- Decouple systemml-standalone.py
- Refractor perf test suit to accept args like debug, stats, config etc...
- Add HDFS support
- Google Docs support
- Compare SystemML with previous versions
- Pylint, Comment
- Extra arguments configuration Test
- Windows Test
- Doc update
- systemml standalone comments
- systemml spark submit comments

Closes #575


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/83b9a221
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/83b9a221
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/83b9a221

Branch: refs/heads/gh-pages
Commit: 83b9a221e5bee8a0232d5a470702944aab056fb4
Parents: 5c3c2f2
Author: krishnakalyan3 <[email protected]>
Authored: Tue Aug 1 13:46:30 2017 -0700
Committer: Nakul Jindal <[email protected]>
Committed: Tue Aug 1 13:46:30 2017 -0700

----------------------------------------------------------------------
 python-performance-test.md | 35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/83b9a221/python-performance-test.md
----------------------------------------------------------------------
diff --git a/python-performance-test.md b/python-performance-test.md
index c265bc6..02d3e34 100644
--- a/python-performance-test.md
+++ b/python-performance-test.md
@@ -11,10 +11,13 @@ Our performance tests suit contains `7` families namely 
`binomial`, `multinomial
 
 On a very high level use construct a string with arguments required to run 
each operation. Once this string is constructed we use the subprocess module to 
execute this string and extract time from the standard out. 
 
-We also use `json` module write our configurations to a json file. This ensure 
that our current operation is easy to debug.
+We also use `json` module write our configurations to a json file. This ensure 
that our operations are easy to debug.
 
+We have `7` files in performance test suit:
 
-We have `5` files in performance test suit `run_perftest.py`, `datagen.py`, 
`train.py`, `predict.py` and `utils.py`. 
+- Entry File `run_perftest.py`
+- Supporting Files `datagen.py`, `train.py`, `predict.py`
+- Utility Files `utils_exec.py`, `utils_fs.py`, `utils_misc.py`
 
 `datagen.py`, `train.py` and `predict.py` generate a dictionary. Our key is 
the name of algorithm being processed and values is a list with path(s) where 
all the data required is present. We define this dictionary as a configuration 
packet.
 
@@ -28,7 +31,7 @@ In `train.py` script we have functions required to generate 
training output. We
 
 The file `predict.py` contains all functions for all algorithms in the 
performance test that contain predict script. We return the required 
configuration packet as a result of this script, that contains key as the 
algorithm to run and values with location to read predict json files from.
 
-In the file `utils.py` we have all the helper functions required in our 
performance test. These functions do operations like write `json` files, 
extract time from std out etc.
+In the file(s) `utils_*.py` we have all the helper functions required in our 
performance test. These functions do operations like write `json` files, 
extract time from std out etc.
  
 ### Adding New Algorithms
 While adding a new algorithm we need know if it has to be part of the any pre 
existing family. If this algorithm depends on a new data generation script we 
would need to create a new family. Steps below to take below to add a new 
algorithm.
@@ -75,7 +78,7 @@ Default setting for our performance test below:
 - Matrix size to 10,000 rows and 100 columns.
 - Execution mode `singlenode`.
 - Operation modes `data-gen`, `train` and `predict` in sequence.
-- Matrix type set to `all`. Which will generate `dense` or / and `sparse` 
matrices for all relevant algorithms.
+- Matrix type set to `all`. Which will generate `dense`, `sparse` matrices for 
all relevant algorithms.
 
 ### Examples
 Some examples of SystemML performance test with arguments shown below:
@@ -104,6 +107,9 @@ Run performance test for the algorithms `m-svm` with 
`multinomial` family. Run o
 `
 Run performance test for all algorithms under the family `regression2` and log 
with filename `new_log`.
 
+`./scripts/perftest/python/run_perftest.py --family binomial clustering 
multinomial regression1 regression2 stats1 stats2 --config-dir 
/Users/krishna/open-source/systemml/scripts/perftest/temp3 --temp-dir 
hdfs://localhost:9000/temp3`
+Run performance test for all algorithms using HDFS.
+
 ### Operational Notes
 All performance test depend mainly on two scripts for execution 
`systemml-standalone.py` and `systemml-spark-submit.py`. Incase we need to 
change standalone or spark parameters we need to manually change these 
parameters in their respective scripts.
 
@@ -117,13 +123,26 @@ multinomial|data-gen|0|dense|10k_100| 0.33
 MultiLogReg|train|0|10k_100|dense|6.956
 MultiLogReg|predict|0|10k_100|dense|4.780
 
-These logs can be found in `temp` folder 
(`$SYSTEMML_HOME/scripts/perftest/temp`) in-case not overridden by 
`--temp-dir`. This `temp` folders also contain the data generated during our 
performance test.
+These logs and config `json` files can be found in `temp` folder 
(`$SYSTEMML_HOME/scripts/perftest/temp`) in-case not overridden by 
`--config-dir`.
+
+`--temp-dir` by default points to local file system. We can change this to 
point to a hdfs path by `--temp-dir hdfs://localhost:9000/temp` where all files 
generated during execution will be saved.
+
+Every time a script executes in `data-gen` mode successfully, we write a 
`_SUCCESS` file. If this file exists we ensures that re-runs of the same script 
is not possible. Support for configuration options like `-stats`, `-explain`, 
`--conf` have also been added.
+
+Results obtained by our performance tests can be automatically uploaded to 
google docs.
+
+`./update.py --file ../temp/singlenode.out --exec-mode singlenode --auth 
client_json.json --tag 1.0`
+
+In the example above `--tag` can be a major/minor systemml version and 
`--auth` points to the `json` key required by `google docs`.
+
+Currently we only support time difference between algorithms in different 
versions. This can be obtained by running the script below
+`./stats.py --auth client_json.json --exec-mode singlenode --tags 1.0 2.0`
 
-Every time a script executes in `data-gen` mode successfully, we write a 
`_SUCCESS` file. If this file exists we ensures that re-run of the same script 
is not possible as data already exists.
+Note: Please pip install `https://github.com/burnash/gspread` to use google 
docs client.
 
 ### Troubleshooting
 We can debug the performance test by making changes in the following locations 
based on 
 
-- Please see `utils.py` function `exec_dml_and_parse_time`. In  uncommenting 
the debug print statement in the function `exec_dml_and_parse_time`. This 
allows us to inspect the subprocess string being executed.
+- Please see `utils_exec.py` function `subprocess_exec`.
 - Please see `run_perftest.py`. Changing the verbosity level to `0` allows us 
to log more information while the script runs.
-- Eyeballing the json files generated and making sure the arguments are 
correct.
+- Eyeballing the json files generated and making sure the configuration 
arguments are correct.

[1/2] systemml git commit: [SYSTEMML-1451] phase 2 work

Reply via email to