Baunsgaard commented on pull request #1481:
URL: https://github.com/apache/systemds/pull/1481#issuecomment-1002201875


   
   Hi @DavidSandruTU !
   
   Thanks for the contribution, :1st_place_medal: ,
   i will take it over from here addressing the remaining points and merge it 
shortly.
   to cover your answers and questions:
   
   > Following tests were performed on an old intel i5-6500 processor
   > 
   > Most algorithms execute in similar time of a couple of seconds, but here 
are some outliers:
   > 
   > **StratifiedStatistics on temp/stratstats/A_10k/data: 39.069242381**
   
   this sound reasonable.
   
   > PCA is especially heavy hitting it seems, but in line with the algorithm I 
think(?):
   > 
   > **PCA on temp/dimensionreduction/pcaData5k_2k_dense: 285.998737102**
   
   Yes, PCA performance gets bad once the input have many columns, it has two 
bottle necks, first TSMM that is worse with many columns and potentially rows 
then eigen, that also is bad with many columns.
   
   > ALS seems to generally be somewhat heavy hitting (~20-30 seconds), but a 
more interesting outlier is the direct solve algorithm on sparse data. It 
should perform better in theory, with optimizations, than the same algorithm 
with dense data.
   > 
   > **ALS-DS algorithm on temp/als/X10k_1k_sparse: 132.269487990**
   
   Great observation, sometimes the optimizations from sparse is not beneficial 
for different reasons, but i will look into this one since the difference is in 
4-7x.
   
   > > 1. Make all scripts executable from the same directory starting point.
   > >    i suggest, 'scripts/perftest' and write an error if you are not in a 
directory named that.
   > 
   > Alright, here I want to make sure I've understood everything correctly. If 
I remember right, a single perftest folder, with all necessary logic, is copied 
onto the cluster and executed. I think this is why Matthias changed the data 
generation scripts, in a previous commit, to execute the DMLs in 
"./datagen/...", instead of "../datagen/...".
   > 
   > What should I do here? Should I assume the datagen scripts are in the 
perftest folder or outside of it? Want me to add a copy of the datagen scripts 
to the perftest folder? Also, I'd then make sure the scripts are in the current 
directory 'perftest', as I think there would be no 'scripts' parent directory. 
Am I right in that assumption?
   
   The structure of the folders should be assumed to be the same as in the 
repository, not copied in different ways.
   
   > > 2. Suggestion, make all the code available tor scaling experiments in 
all cases, but add an argument that specify max memory. For instance 
'genClusteringData.sh 80' indicating 80MB max, and make this consistent across 
scripts.
   > 
   > white_check_mark Please check my code and let me know if you like my 
approach. E.g. genBinomialData.sh and runAllBinomial.sh
   
   Great!
   
   > > 3. Go through your changes and change all the &>> to &> and >> to >, I 
am a bit confused about the '&' part of it, since i don't know what it does in 
context to '>'. If you can explain it it would be nice, otherwise remove the 
'&' from the piping to files. (this comment is unrelated to adding '&' at the 
end of the lines, since i know this parallelize the instruction)
   > 
   > white_check_mark I think the "&" is so that error output is also put into 
the file, not only the standard out. I kept it that way from the original 
scripts, but never confirmed it myself. Most outputs are changed to now 
overwrite, not append, except for some special cases:
   > 
   >     * The federated algorithm often accesses a file multiple times, e.g. 
once when creating and once when killing workers. Here I append instead of 
overwrite.
   > 
   >     * Similar with results/times.txt, where all algorithms append their 
performance times.
   
   Good!
   
   > > 5. Make some of the smallest data generation run in parallel, this is 
simple to add with a & on the end of the line in sh, and then add a 'wait' in 
the end of the scripts.
   > 
   > white_check_mark Some scripts have commands that split data into test and 
training data. These depend on the previous data generation command. I've added 
multiple waits with PIDs for that, let me know if you're happy with that. E.g. 
genBinomialData.sh
   
   This is great!
   
   > > 6. Run all... should run all, don't comment anything out. but add an 
argument to stop execution based on the max Memory argument
   > 
   > Do you want me to uncomment the matrix multiply and transform benchmark 
too, or was that just meant in context of the max memory argument?
   
   yes, i want to assume that everything is run since the intention is to get 
performance benchmarks, to find regressions then running everything to get 
numbers is the default action and therefore what the scripts should reflect.
   
   > I've moved/copied some bash logic relating to the federated tests into the 
fed folder, as I've written new performance tests for als-predict and als-DS. 
These new perftests were originally not part of the federated tests, which is 
why I added the "original" versions to the fed folder, to be executed by the 
fed tests alone.
   
   Good!
   
   > I've made a file "env-variables", which is sourced once in runAll.sh. The 
problem here is, if the federated test is ran with that config (from 
runAll.sh), then each worker then has that heap size (e.g. 10GB), which crashes 
the test in my case. I've added that "export" commented to the file.
   
   This is okay, we probably should modify the federated performance suite to 
use different memory budgets for the workers and controller anyway but this is 
out of scope for this commit.
   
   > > > ```
   > > >   * It does not stop when ran with logistic regression data.
   > > > ```
   > > 
   > > 
   > > hmm, this might be a bug, please if you want to, report it as such in 
JIRA with a way to reproduce it.
   > > 
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335852
   > 
   > Will do.
   
   Thanks for the issue.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to