[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

via GitHub Sun, 25 Jun 2023 03:59:26 -0700


kykrueger commented on PR #1847:
URL: https://github.com/apache/systemds/pull/1847#issuecomment-1606026498


   @Baunsgaard thanks for the recommendations. The current approach should meet 
the requirements for measuring the loading time for all of the options in 
python. It uses an option I'd noticed after writing my last comment, and I 
believe it keeps all relevant compilation steps while avoiding any extra 
operations after loading.
   
   It sticks with timeit instead of switching to the time functions, which I 
think continues to add a benefit without much added complexity. Looking at 
timeit benchmark scripts may first give the impression of an antipattern since 
the to be measured code is written in strings, but it seems to be the pythonic 
way.
   
   There are a few things I see left for me to do:
   
   - [ ] figure out why the pandas benchmark is hanging
   - [ ] make the log/stdout match the style used in the other perftest scripts
   - [ ] add an input option for the binary data format instead of csv
   - [ ] wrap a DML script in bash which allows the user to choose a number of 
run times and pass a csv/binary file
   - [ ] choose a variety of generated input data
   - [ ] call all of the benchmark scripts in one bash script similar to the 
other perftest benchmarks.
   
   That leaves me with a few more things to check with you before I complete 
the above tasks.
   - Do the timeit benchmarks I've created so far cover all of the use-cases 
you had in mind?
   - You asked for a variety of data-types when we started the ticket, I'd have 
expected that to mean floats vs ints vs strings vs doubles and so on. However, 
I've noticed that the systemds docs describe Frame, Matrix and Scalars as the 
primary data-types. So, I assume that you also meant only Frames and Matrices 
as the significant data-types to be tested?
   - Initially I'd planned on writing some of my own data-generation scripts, 
but I noticed the existing generic one 
[scripts/utils/generateData.dml](https://github.com/apache/systemds/blob/main/scripts/utils/generateData.dml)
 and didn't see much value in creating a similar one. However, I am also 
considering adding a few extra runs using some of the generation scripts in 
scripts/perftest/datagen or scripts/datagen, but am not sure if that is in 
scope of the initial ticket. What do you think, is the use of the existing 
datagen what you were expecting?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

Reply via email to