[GitHub] [arrow] jonkeane commented on pull request #10269: ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create()

GitBox Fri, 14 May 2021 06:18:18 -0700


jonkeane commented on pull request #10269:
URL: https://github.com/apache/arrow/pull/10269#issuecomment-841238315

A couple of comments/additions, that I think you're generally right.

The R benchmarks tend to be stable
(https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/
is a run on the master branch and is between -3% and 1% change and that -3% is
an outlier there, the next largest decrease is -0.8%). So we can have decent
confidence that we're not observing noise alone here. We're working actively to
improve this, but wanted to put it out there as part of the assumptions I'm
using.

There are some file-read benchmarks that are >5% slower, interestingly it is
all (and only) the fanniemae dataset that is slower (both reading from parquet
and from feather) and *only* when it is being converted to a data.frame, not
when it is being left as a table. This seems a little suspect to me since the
only places that I'm seeing you've meaningfully changed the code is
`RecordBatch$create`, `Table$create`, and `MakeArrayFromScalar`. Do any of
those get called when reading parquet or feather files?

Note: I don't see csv reads run here, IIRC those were proactively disabled
due to memory issues, but I will confirm that (and I thought this machine
should have been able to handle these and there is
https://issues.apache.org/jira/browse/ARROW-12519 to track).

There are also another number of benchmarks that are in the 5-1% slower
range (the other file-read, as well as the df to R conversions, and a handful
of the writing benchmarks). The df to R conversions seem more in line with the
code that was changed, and those are in the 3-6% range (though most are closer
to 3%, with one being an outlier at 6%)

The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or
~14% of the benchmarks are 0-1% faster. These are probably all just noise.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on pull request #10269: ARROW-11705: [R] Support scalar value recycling in RecordBatch/Table$create()

Reply via email to