jonkeane commented on pull request #10269:
URL: https://github.com/apache/arrow/pull/10269#issuecomment-841238315


   A couple of comments/additions, that I think you're generally right.
   
   The R benchmarks tend to be stable 
(https://conbench.ursa.dev/compare/runs/8b6fef07829948998502a7677dec6e03...0cbd9dcbe2594e06ab95cf0e088cf25b/
 is a run on the master branch and is between -3% and 1% change and that -3% is 
an outlier there, the next largest decrease is -0.8%). So we can have decent 
confidence that we're not observing noise alone here. We're working actively to 
improve this, but wanted to put it out there as part of the assumptions I'm 
using.
   
   There are some file-read benchmarks that are >5% slower, interestingly it is 
all (and only) the fanniemae dataset that is slower (both reading from parquet 
and from feather) and *only* when it is being converted to a data.frame, not 
when it is being left as a table. This seems a little suspect to me since the 
only places that I'm seeing you've meaningfully changed the code is 
`RecordBatch$create`, `Table$create`, and `MakeArrayFromScalar`. Do any of 
those get called when reading parquet or feather files? 
   
   Note: I don't see csv reads run here, IIRC those were proactively disabled 
due to memory issues, but I will confirm that (and I thought this machine 
should have been able to handle these and there is 
https://issues.apache.org/jira/browse/ARROW-12519 to track).
   
   There are also another number of benchmarks that are in the 5-1% slower 
range (the other file-read, as well as the df to R conversions, and a handful 
of the writing benchmarks). The df to R conversions seem more in line with the 
code that was changed, and those are in the 3-6% range (though most are closer 
to 3%, with one being an outlier at 6%)
   
   The next 28/128 or ~20% of the benchmarks are 0-1% slower and then 19/138 or 
~14% of the benchmarks are 0-1% faster. These are probably all just noise.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to