[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

via GitHub Mon, 26 Jun 2023 00:55:23 -0700


kykrueger commented on PR #1847:
URL: https://github.com/apache/systemds/pull/1847#issuecomment-1606909971


   @Baunsgaard 
   > Well, not really. The data when read from CSV is always treated as worst 
case datatypes.
   this means in practice when we transfer the data over the scheme says string 
string string etc and therefore most likely is using the worst transfer.
   
   Since I generated the data with systemds, it comes with a schema. When it 
comes to the interpolation, I had meant to say I didn't specify dtypes for 
numpy and pandas, but they had been smart enough to recognize the values as 
float64. 
   
   In order to get benchmarks for the other data-types, I could pass in a dtype 
optionally to my scripts which forces numpy and pandas to read put the values 
into the type we want. I'll also check out overriding the types provided in the 
MTD files in the read, command for systemds. I think I remember seeing 
something in your docs which make that possible. This should make it easy to 
test for all of the data types you listed, without the creation of additional 
scripts, how about it?
   
   - [ ] double, float, long, int, uint8, boolean and BitSets. 
   
   I'm not sure about why you would like to generate the data in the scripts, 
that would leave us with writing the same generation code multiple times. I'd 
prefer to keep using the generation format similar to the existing perftest 
scripts. However, I don't think the large data-sizes are very useful. Since the 
80MB dataset stalls pandas for much too long, I'm going to add a smaller 
generation option to the list and set it as the default. 
   
   > Also if you have any numbers of the things already, please just post them 
even if they are bad!
   
   As for the run-times, these values came out of a run with 1 repeat with a 
small dataset, of 100 columns and 10k rows if I remember correctly.
   
   ```
   RUN IO Benchmarks:  Sun 25 Jun 2023 10:19:23 PM CEST
   -- Running IO benchmarks on custom
   read.dml:            2.932409849
   load_native.py:      3.3357212970149703
   load_numpy.py:       2.91045009897789
   load_pandas.py:      33.540340821957216
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] kykrueger commented on pull request #1847: [SYSTEMDS-2834] python IO benchmarking

Reply via email to