jonkeane commented on pull request #10269: URL: https://github.com/apache/arrow/pull/10269#issuecomment-841238315
A couple of comments/additions, that I think you're generally right. The R benchmarks tend to be stable (https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/ is a run on the master branch and is between -3% and 1% change and that -3% is an outlier there, the next largest decrease is -0.8%). So we can have decent confidence that we're not observing noise alone here. We're working actively to improve this, but wanted to put it out there as part of the assumptions I'm using. There are some file-read benchmarks that are >5% slower, interestingly it is all (and only) the fanniemae dataset that is slower (both reading from parquet and from feather) and *only* when it is being converted to a data.frame, not when it is being left as a table. This seems a little suspect to me since the only places that I'm seeing you've meaningfully changed the code is `RecordBatch$create`, `Table$create`, and `MakeArrayFromScalar`. Do any of those get called when reading parquet or feather files? Note: I don't see csv reads run here, IIRC those were proactively disabled due to memory issues, but I will confirm that (and I thought this machine should have been able to handle these and there is https://issues.apache.org/jira/browse/ARROW-12519 to track). There are also another number of benchmarks that are in the 5-1% slower range (the other file-read, as well as the df to R conversions, and a handful of the writing benchmarks). The df to R conversions seem more in line with the code that was changed, and those are in the 3-6% range (though most are closer to 3%, with one being an outlier at 6%) The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or ~14% of the benchmarks are 0-1% faster. These are probably all just noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org