[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

via GitHub Sun, 18 Jun 2023 12:52:07 -0700


kykrueger commented on PR #1847:
URL: https://github.com/apache/systemds/pull/1847#issuecomment-1596249203


   @Baunsgaard 
   
   I have a few questions about requirements now that I've gotten started with 
a proof-of-concept script.
   
   1. I've selected the timeit module to help make the times more reproducible 
since it lets us elegantly ignore setup in the recorded time and disable some 
sources of additional processing time like the garbage-collector in python. 
However, I wasn't sure if this is something you want. Are you on-board with 
this approach, or did you want more of a big-picture benchmark with all 
overhead included? 
   2.  I've noticed that the conversions from numpy and pandas data is 
transferred lazily to SystemDS in the JVM. So, calling `from_numpy` and 
`from_pandas`, alone isn't letting me evaluate the real overhead. If you choose 
a big-picture benchmark above, I figure here I'd be stuck adding some sort of 
simple operator and calling compute or get_lineage to force the data to load. 
Another option would be to trick systemds into loading the data by creating an 
empty `DMLScript` and calling  the hidden internal `__prepare_script()` method 
because it has a bit less overhead than `get_lineage()`. This second option is 
what I'd prefer since it is more isolated, does it meet your needs?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

Reply via email to