kykrueger commented on PR #1847: URL: https://github.com/apache/systemds/pull/1847#issuecomment-1606026498
@Baunsgaard thanks for the recommendations. The current approach should meet the requirements for measuring the loading time for all of the options in python. It uses an option I'd noticed after writing my last comment, and I believe it keeps all relevant compilation steps while avoiding any extra operations after loading. It sticks with timeit instead of switching to the time functions, which I think continues to add a benefit without much added complexity. Looking at timeit benchmark scripts may first give the impression of an antipattern since the to be measured code is written in strings, but it seems to be the pythonic way. There are a few things I see left for me to do: - [ ] figure out why the pandas benchmark is hanging - [ ] make the log/stdout match the style used in the other perftest scripts - [ ] add an input option for the binary data format instead of csv - [ ] wrap a DML script in bash which allows the user to choose a number of run times and pass a csv/binary file - [ ] choose a variety of generated input data - [ ] call all of the benchmark scripts in one bash script similar to the other perftest benchmarks. That leaves me with a few more things to check with you before I complete the above tasks. - Do the timeit benchmarks I've created so far cover all of the use-cases you had in mind? - You asked for a variety of data-types when we started the ticket, I'd have expected that to mean floats vs ints vs strings vs doubles and so on. However, I've noticed that the systemds docs describe Frame, Matrix and Scalars as the primary data-types. So, I assume that you also meant only Frames and Matrices as the significant data-types to be tested? - Initially I'd planned on writing some of my own data-generation scripts, but I noticed the existing generic one [scripts/utils/generateData.dml](https://github.com/apache/systemds/blob/main/scripts/utils/generateData.dml) and didn't see much value in creating a similar one. However, I am also considering adding a few extra runs using some of the generation scripts in scripts/perftest/datagen or scripts/datagen, but am not sure if that is in scope of the initial ticket. What do you think, is the use of the existing datagen what you were expecting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org