kykrueger commented on PR #1847: URL: https://github.com/apache/systemds/pull/1847#issuecomment-1606909971
@Baunsgaard > Well, not really. The data when read from CSV is always treated as worst case datatypes. this means in practice when we transfer the data over the scheme says string string string etc and therefore most likely is using the worst transfer. Since I generated the data with systemds, it comes with a schema. When it comes to the interpolation, I had meant to say I didn't specify dtypes for numpy and pandas, but they had been smart enough to recognize the values as float64. In order to get benchmarks for the other data-types, I could pass in a dtype optionally to my scripts which forces numpy and pandas to read put the values into the type we want. I'll also check out overriding the types provided in the MTD files in the read, command for systemds. I think I remember seeing something in your docs which make that possible. This should make it easy to test for all of the data types you listed, without the creation of additional scripts, how about it? - [ ] double, float, long, int, uint8, boolean and BitSets. I'm not sure about why you would like to generate the data in the scripts, that would leave us with writing the same generation code multiple times. I'd prefer to keep using the generation format similar to the existing perftest scripts. However, I don't think the large data-sizes are very useful. Since the 80MB dataset stalls pandas for much too long, I'm going to add a smaller generation option to the list and set it as the default. > Also if you have any numbers of the things already, please just post them even if they are bad! As for the run-times, these values came out of a run with 1 repeat with a small dataset, of 100 columns and 10k rows if I remember correctly. ``` RUN IO Benchmarks: Sun 25 Jun 2023 10:19:23 PM CEST -- Running IO benchmarks on custom read.dml: 2.932409849 load_native.py: 3.3357212970149703 load_numpy.py: 2.91045009897789 load_pandas.py: 33.540340821957216 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org