> > On 8 March 2012 at 11:06, Justin Talbot wrote: > | I've been working on an R performance academic project for the last > | couple years which has involved writing an interpreter for R from > | scratch and a JIT for R vector operations. > > Cool. I think John mention that once or twice and I promptly forgot. > > Can you share some numbers? >
Sure, I'll give a quick summary. We're writing a paper on it right now which will have more details. We currently execute scalar R code (non-vectorized) through an interpreter we wrote from scratch. We haven't put a whole lot of time into it; it supports most of the important R semantics, but does not yet implement most of the functions in src/names.c, which limits the scope of real world code we can run. On a set of microbenchmarks (careful what you conclude from microbenchmarks!) it runs about 4-5x faster than Luke's bytecode interpreter. The interpreter is still about 3-4x slower than LuaJIT's interpreter, probably the fastest dynamic language interpreter out there, so there is room for further improvement, but not a lot. (Lua is a much cleaner language from the performance standpoint and LuaJIT's interpreter is written in assembly. We don't anticipate doing that anytime soon.) We execute vectorized code through a JIT that generates SSE code and can parallelize across multiple cores. Performance here depends greatly on the vector size and number of vectors since our performance gain primarily comes from eliminating memory accesses. For long vectors (1M+ elements) we've seen gains from about 5x-50x on a single core for plausible workloads. We don't have good numbers on the parallelization yet, but we have seen linear scalability out to 32 cores for a couple of our workloads. Scalability is clearly very task dependent and we don't expect to get large numbers across the board. One implication of having a JIT is that we now implement a lot of functionality at the R level rather than in C functions. For example, we implement matrix-vector multiplication as: r <- 0 for(i in 1L:ncol(m)) { r <- r + m[,i]*v[[i]] } This beats R's built-in matrix-vector multiplication by a factor of 2 for "large" matrices (at least one dimension larger than 1000 or so) and will parallelize without any more work from the programmer. With more work to squeeze out our JIT overheads this could be effective even for much smaller matrices. > | So why is R performance poor now? I think the fundamental reason is > | related to software engineering: R is nearly impossible to experiment > | with, so no one tries out new performance techniques on it. There are > > Did you compare notes with the CXXR project by Andrew Runnalls and his > student(s)? See http://www.cs.kent.ac.uk/projects/cxxr/ > I haven't talked to them, but I should! Looking at their slides it looks like their approach will be effective at making the R core more extensible, but it's somewhat antagonistic to pure interpreter performance. I didn't see any performance numbers, but I would guess that CXXR runs somewhat slower than the current interpreter. The key for being able to experiment with performance is for the core code to be small and well defined, not necessarily extensible. > | I see little value is debating changes to the language semantics until > | we've addressed this low hanging fruit and at least tried to make the > | current R/S semantics run fast. > > Fully agree. > > I'd add that helping expand R via the FFI also works, though it is of course > not as easy on the end user as making the core faster. > FFI is extremely important and Rcpp is a great step forward. I'll just note that FFI and performance interact. An FFI like .Call/.External exposes too much of R's internal implementation details to users, making it difficult to improve performance in the core while maintaining backwards compatibility. It would be much better if R's high-performance FFI were something like Rcpp itself, hiding almost all implementation details from the user. Just one example on FFIs. .Call/.External lets users get raw pointers to vector data (e.g. NUMERIC_POINTER). This is fine and dandy as long as all implementations store vectors contiguously in memory. But, some implementations may not want this. For example, Clojure gets high-performance updates to its immutable arrays by storing them in a tree data structure instead of flat in memory. This would be a nice technique to port to R, but it breaks .Call packages. A better FFI choice would have used something like NUMERIC_ELEMENT(x,i) to hide the details of how element i is looked up in vector x. This would have been just as fast for current packages while leaving a forward path for more performance improvements. Justin Justin ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel