zhengruifeng edited a comment on pull request #30468: URL: https://github.com/apache/spark/pull/30468#issuecomment-732111713
It looks like that: 1, `GEMM` is only about 7% slower than master, I guess it can be furthermore accelerated via native blas impl. But it need a big buffer (**m*n**), I think it somewhat dangerous; maybe we can split a block (whose size is optimized for `crossJoin`) into sub-blocks (whose size is optimized for `gemm`) to reduce this buffer, but I think it will be too convoluted; 2, Compared with `DOT` based impls, `GEMV` should be a nice choice. It is much more faster (even with `f2jBLAS`), and the buffer size is relative small (**n**); 3, [Guava.Ordering](https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/Ordering.java#L723) is much faster than `BoundedPriorityQueue`. With Guava.Ordering, we do not need to create `Tuple2` objects. Above tests are done locally, since I do not have a clean cluster for now. And only `f2jBLAS` is used, since after upgrading to Ubuntu 20.04, I fail to link netlib-java to native impls for now. friendly ping @srowen @MLnick @mpjlu @jkbradley @mengxr @WeichenXu123, because of your comments in previous prs (https://github.com/apache/spark/pull/17742, https://github.com/apache/spark/pull/17845, https://github.com/apache/spark/pull/18624) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
