Baunsgaard commented on pull request #1481: URL: https://github.com/apache/systemds/pull/1481#issuecomment-1002201875
Hi @DavidSandruTU ! Thanks for the contribution, :1st_place_medal: , i will take it over from here addressing the remaining points and merge it shortly. to cover your answers and questions: > Following tests were performed on an old intel i5-6500 processor > > Most algorithms execute in similar time of a couple of seconds, but here are some outliers: > > **StratifiedStatistics on temp/stratstats/A_10k/data: 39.069242381** this sound reasonable. > PCA is especially heavy hitting it seems, but in line with the algorithm I think(?): > > **PCA on temp/dimensionreduction/pcaData5k_2k_dense: 285.998737102** Yes, PCA performance gets bad once the input have many columns, it has two bottle necks, first TSMM that is worse with many columns and potentially rows then eigen, that also is bad with many columns. > ALS seems to generally be somewhat heavy hitting (~20-30 seconds), but a more interesting outlier is the direct solve algorithm on sparse data. It should perform better in theory, with optimizations, than the same algorithm with dense data. > > **ALS-DS algorithm on temp/als/X10k_1k_sparse: 132.269487990** Great observation, sometimes the optimizations from sparse is not beneficial for different reasons, but i will look into this one since the difference is in 4-7x. > > 1. Make all scripts executable from the same directory starting point. > > i suggest, 'scripts/perftest' and write an error if you are not in a directory named that. > > Alright, here I want to make sure I've understood everything correctly. If I remember right, a single perftest folder, with all necessary logic, is copied onto the cluster and executed. I think this is why Matthias changed the data generation scripts, in a previous commit, to execute the DMLs in "./datagen/...", instead of "../datagen/...". > > What should I do here? Should I assume the datagen scripts are in the perftest folder or outside of it? Want me to add a copy of the datagen scripts to the perftest folder? Also, I'd then make sure the scripts are in the current directory 'perftest', as I think there would be no 'scripts' parent directory. Am I right in that assumption? The structure of the folders should be assumed to be the same as in the repository, not copied in different ways. > > 2. Suggestion, make all the code available tor scaling experiments in all cases, but add an argument that specify max memory. For instance 'genClusteringData.sh 80' indicating 80MB max, and make this consistent across scripts. > > white_check_mark Please check my code and let me know if you like my approach. E.g. genBinomialData.sh and runAllBinomial.sh Great! > > 3. Go through your changes and change all the &>> to &> and >> to >, I am a bit confused about the '&' part of it, since i don't know what it does in context to '>'. If you can explain it it would be nice, otherwise remove the '&' from the piping to files. (this comment is unrelated to adding '&' at the end of the lines, since i know this parallelize the instruction) > > white_check_mark I think the "&" is so that error output is also put into the file, not only the standard out. I kept it that way from the original scripts, but never confirmed it myself. Most outputs are changed to now overwrite, not append, except for some special cases: > > * The federated algorithm often accesses a file multiple times, e.g. once when creating and once when killing workers. Here I append instead of overwrite. > > * Similar with results/times.txt, where all algorithms append their performance times. Good! > > 5. Make some of the smallest data generation run in parallel, this is simple to add with a & on the end of the line in sh, and then add a 'wait' in the end of the scripts. > > white_check_mark Some scripts have commands that split data into test and training data. These depend on the previous data generation command. I've added multiple waits with PIDs for that, let me know if you're happy with that. E.g. genBinomialData.sh This is great! > > 6. Run all... should run all, don't comment anything out. but add an argument to stop execution based on the max Memory argument > > Do you want me to uncomment the matrix multiply and transform benchmark too, or was that just meant in context of the max memory argument? yes, i want to assume that everything is run since the intention is to get performance benchmarks, to find regressions then running everything to get numbers is the default action and therefore what the scripts should reflect. > I've moved/copied some bash logic relating to the federated tests into the fed folder, as I've written new performance tests for als-predict and als-DS. These new perftests were originally not part of the federated tests, which is why I added the "original" versions to the fed folder, to be executed by the fed tests alone. Good! > I've made a file "env-variables", which is sourced once in runAll.sh. The problem here is, if the federated test is ran with that config (from runAll.sh), then each worker then has that heap size (e.g. 10GB), which crashes the test in my case. I've added that "export" commented to the file. This is okay, we probably should modify the federated performance suite to use different memory budgets for the workers and controller anyway but this is out of scope for this commit. > > > ``` > > > * It does not stop when ran with logistic regression data. > > > ``` > > > > > > hmm, this might be a bug, please if you want to, report it as such in JIRA with a way to reproduce it. > > https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335852 > > Will do. Thanks for the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org