On Thu, Jun 14, 2012 at 2:36 PM, Dirk Eddelbuettel <[email protected]> wrote: > > On 15 June 2012 at 02:56, c s wrote: > | Simply installing ATLAS (which provides speed-ups for several Lapack > | functions) on Debian/Ubuntu systems can already make a big difference. > | (Debian & Ubuntu use a trick to redirect Lapack and Blas calls to > | ATLAS). Under Mac OS X, the Accelerate framework provides fast > | implementations of Lapack and Blas functions (eg. using > | multi-threading). > > I found OpenBLAS to be faster than Atlas (with both coming from the > distribution), and tend to install that now. It uses a multithreaded approach > by default, but I have to double check one thing about cpu affinity which, > when used from R, has been seen to interfere. That is possible reason for > the single-threaded performance here. I'll report back. > > | I've taken the modular approach to Armadillo (ie. using Lapack rather > | than reimplementing decompositions), as it specifically allows other > | specialist parties (such as Intel) to provide Lapack that is highly > | optimised for particular architectures. I myself would not be able to > | keep up with the specific optimisations required for each CPU. This > | also "future-proofs" Armadillo for each new CPU generation. > | > | More importantly, numerically stable implementation of computational > | decompositions/factorisations is notoriously difficult to get right. > | The core algorithms in Lapack have been evolving for the past 20+ > | years, being exposed to a bazillion corner-cases. Lapack itself is > | related to Linpack and Eispack, which are even older. I've been > | exposed to software development long enough to know that in the end > | only time can shake out all the bugs. As such, using Lapack is far > | less risky than reimplementing decompositions from scratch. A > | "home-made" matrix decomposition might be a bit faster on a particular > | CPU, but you have far less knowledge as to when it's going to blow up > | in your face. > | > | High-performance variants of Lapack, such as MKL, take an existing > | proven implementation of a decomposition algorithm and recode parts of > | it in assembly, and/or parallelise other parts. > > All excellent points which nobody disputes. It just so happens that Eigen > still looks better in some benchmarks but as you say we have to ensure we > really compare apples to apples.
While I realize that Conrad has strong opinions on these issues and is unlikely to be moved from those opinions, I would make a few points in favor of Eigen. - using BLAS/Lapack is effective because that is mature, well-tested code. It is ineffective because the reference implementations of BLAS/Lapack are in Fortran 77 and suffer from all the idiosyncrasies of that ancient language. These include call-by-reference semantics, even for scalars, lack of structures beyond pointers to arrays and the inability to allocate storage. Hence, the caller frequently must pass an array 'work' and it length 'lwork' and, occasionally, 'rwork' and 'iwork'. There is a helpful convention for getting the length of the work array correct which involves a preliminary call with lwork = -1 that does nothing but write the desired work array length into the first element of the work array. The caller must then allocate the work array, change the value of lwork and call the routine a second time. Because the only structure available is a pointer to a one-dimensional array, you must pass a general matrix as four arguments M, N, A, LDA. There are no machine-readable header files. Jeff Bezanson recently joked about writing a parser for the upper-case Fortran comments to create headers. Does any of this sound like 21st century software design? - MKL blas is very good on Intel hardware. Naturally Intel doesn't go to great lengths to support AMD processors to the same extent. MKL is proprietary, closed-source software and plugging it into an open-source software systems like R and Aramadillo feels like a compromise. There are open alternatives like Atlas and OpenBlas. However, these all need to conform to the Fortran calling sequence specifications (Lapacke and CBLAS do provide C calling sequences but it is not a good idea to count on their availability at present). And notice that this approach contradicts the "Lapack code is complex but widely tested so it should be left untouched" argument. The way to get good performance in OpenBlas and MKL is to rewrite the BLAS and heavily-used parts of Lapack, like dgetrf, in a combination of C and Assembler. I would also point out that one of the reasons that Lapack code is complex and difficult to understand is because it is written in the archaic language Fortran 77. - the approach in Eigen was to use templated C++ classes for dense and sparse matrices and for the decompositions and solvers. I don't think readers of this group need to be reminded of the advantages of templated object-oriented C++ code. Eigen has a bit of a learning curve but it also has a great range of capabilities. These comments may provoke a heated response from Conrad but, if so, I don't plan to respond further. Eigen and Armadillo are different approaches, each with their own advantages and disadvantages. _______________________________________________ Rcpp-devel mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
