Thanks, this is what I currently do :) However, I'd like to find a solution that is both memory efficient (X can be very large) and which does not modify X in place.
Basically, I'm wondering whether there was a BLAS subroutine that would allow to compute cross(X, w, Y) in one pass without creating an intermediate matrix as large as X or Y.
