[GitHub] [systemds] DavidSandruTU edited a comment on pull request #1481: [SYSTEMDS-2832] Refactoring of old performance benchmarks.

GitBox Thu, 23 Dec 2021 15:31:30 -0800


DavidSandruTU edited a comment on pull request #1481:
URL: https://github.com/apache/systemds/pull/1481#issuecomment-1000563066



   > Hi, @DavidSandruTU !
   
   Thanks for the feedback, @Baunsgaard !
   
   > This is a good start, i have only minor things, but each take a bit of 
time. Just reach out if there is questions, also please try to run it once you 
have the limit argument just with the smallest size of each data, and post some 
of the numbers in the comments :), especially if you find some of them 
interesting or weird.
   
   Following tests were performed on an old intel i5-6500 processor
   
   Most algorithms execute in similar time of a couple of seconds, but here are 
some outliers:
   
   **StratifiedStatistics on temp/stratstats/A_10k/data: 39.069242381**
   
   PCA is especially heavy hitting it seems, but in line with the algorithm I 
think(?):
   
   **PCA on temp/dimensionreduction/pcaData5k_2k_dense: 285.998737102**
   
   ALS seems to generally be somewhat heavy hitting (~20-30 seconds), but a 
more interesting outlier is the direct solve algorithm on sparse data. It 
should perform better in theory, with optimizations, than the same algorithm 
with dense data.
   
   **ALS-DS algorithm on temp/als/X10k_1k_sparse: 132.269487990**
   
   > 1. Make all scripts executable from the same directory starting point.
   >    i suggest, 'scripts/perftest' and write an error if you are not in a 
directory named that.
   
   Alright, here I want to make sure I've understood everything correctly. If I 
remember right, a single perftest folder, with all necessary logic, is copied 
onto the cluster and executed. I think this is why Matthias changed the data 
generation scripts, in a previous commit, to execute the DMLs in 
"./datagen/...", instead of "../datagen/...". 
   
   What should I do here? Should I assume the datagen scripts are in the 
perftest folder or outside of it? Want me to add a copy of the datagen scripts 
to the perftest folder?
   Also, I'd then make sure the scripts are in the current directory 
'perftest', as I think there would be no 'scripts' parent directory. Am I right 
in that assumption?
   
   > 2. Suggestion, make all the code available tor scaling experiments in all 
cases, but add an argument that specify max memory. For instance 
'genClusteringData.sh 80' indicating 80MB max, and make this consistent across 
scripts.
   
   ✅
   Please check my code and let me know if you like my approach. E.g. 
genBinomialData.sh and runAllBinomial.sh
   
   > 3. Go through your changes and change all the &>> to &> and >> to >, I am 
a bit confused about the '&' part of it, since i don't know what it does in 
context to '>'. If you can explain it it would be nice, otherwise remove the 
'&' from the piping to files. (this comment is unrelated to adding '&' at the 
end of the lines, since i know this parallelize the instruction)
   
   ✅
   I think the "&" is so that error output is also put into the file, not only 
the standard out. I kept it that way from the original scripts, but never 
confirmed it myself. 
   Most outputs are changed to now overwrite, not append, except for some 
special cases:
   - The federated algorithm often accesses a file multiple times, e.g. once 
when creating and once when killing workers. Here I append instead of overwrite.
   - Similar with results/times.txt, where all algorithms append their 
performance times.
   
   > 4. Don't change permissions on log4j files and DML files.
   >    You have done so on:
   > 
   > * scripts/perftest/conf/log4j.properties
   > * scripts/perftest/fed/data/splitAndMakeFederated.dml
   > * scripts/perftest/scripts/GLM-predict.dml
   > * scripts/perftest/scripts/alsCG.dml
   
   ✅
   Today I learned, Git commits permissions.
   
   > 5. Make some of the smallest data generation run in parallel, this is 
simple to add with a & on the end of the line in sh, and then add a 'wait' in 
the end of the scripts.
   
   ✅
   Some scripts have commands that split data into test and training data. 
These depend on the previous data generation command. I've added multiple waits 
with PIDs for that, let me know if you're happy with that. E.g. 
genBinomialData.sh
   
   > 6. Run all... should run all, don't comment anything out. but add an 
argument to stop execution based on the max Memory argument
   
   Do you want me to uncomment the matrix multiply and transform benchmark too, 
or was that just meant in context of the max memory argument?
   
   > 7. Make all the loops go though all datasets, subject to the memory 
constraints, example 'scripts/perftest/runAllMultinominal.sh'
   
   ✅
   This should be related to point 2. and 6., and is implemented. 
   
   > 8. If you make the commands multi line with '' then make sure you indent 
the following lines. example in 'scripts/perftest/runKmeans.sh'
   
   ✅
   
   > 9. Remove commented out code, that obviously is not used and should not be 
used. example in 'scripts/perftest/runStratStats.sh'
   
   The calls for the old algorithms? I thought they might be helpful to compare 
the output with the new built-ins, but I'll remove them ✅
   
   > 10. Add empty newline in the end of files example 
'scripts/perftest/scripts/stratstats.dml'
   
   ✅
   
   
   > 
   > > ```
   > > * I'm assuming the federated tests are to be ran locally only, not on a 
cluster.
   > > ```
   > 
   > True, they are running with localhost workers, We want to add the 
federated things, but it require a bit of setup and remote machines so not 
obvious for a perf benchmark.
   
   I've moved/copied some bash logic relating to the federated tests into the 
fed folder, as I've written new performance tests for als-predict and als-DS. 
These new perftests were originally not part of the federated tests, which is 
why I added the "original" versions to the fed folder, to be executed by the 
fed tests alone.
   
   > > ```
   > > * The stratified statistics benchmark needs a heap size of at least 10 
GB. This setting is exported only when it's ran locally via "export 
SYSTEMDS_STANDALONE_OPTS="-Xmx10g -Xms10g -Xmn2000m"
   > > ```
   > Change this, so that it is set via a settings file, or other means.
   
   I've made a file "env-variables", which is sourced once in runAll.sh. The 
problem here is, if the federated test is ran with that config (from 
runAll.sh), then each worker then has that heap size (e.g. 10GB), which crashes 
the test in my case. I've added that "export" commented to the file.
   
   
   
   > > ```
   > >   * It does not stop when ran with logistic regression data.
   > > ```
   > 
   > hmm, this might be a bug, please if you want to, report it as such in JIRA 
with a way to reproduce it.
   > 
   > https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335852
   
   Will do.
   
   
   
   Also, Merry Christmas! 🎅
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] DavidSandruTU edited a comment on pull request #1481: [SYSTEMDS-2832] Refactoring of old performance benchmarks.

Reply via email to