[spark] branch branch-3.0 updated: [SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs

huaxingao Tue, 28 Jul 2020 08:44:15 -0700

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 8cfb718  [SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration 
docs
8cfb718 is described below

commit 8cfb7183865c5358a547ec892f10d4f1350300ff
Author: Xiaochang Wu <xiaochang...@intel.com>
AuthorDate: Tue Jul 28 08:36:11 2020 -0700

    [SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs
    
    ### What changes were proposed in this pull request?
    Rewrite a clearer and complete BLAS native acceleration enabling guide.
    
    ### Why are the changes needed?
    The document of enabling BLAS native acceleration in ML guide 
(https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete 
and unclear to the user.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    N/A
    
    Closes #29139 from xwu99/blas-doc.
    
    Lead-authored-by: Xiaochang Wu <xiaochang...@intel.com>
    Co-authored-by: Wu, Xiaochang <xiaochang...@intel.com>
    Signed-off-by: Huaxin Gao <huax...@us.ibm.com>
    (cherry picked from commit 44c868b73a7cb293ec81927c28991677bf33ea90)
    Signed-off-by: Huaxin Gao <huax...@us.ibm.com>
---
 docs/ml-guide.md        |  22 +++--------
 docs/ml-linalg-guide.md | 103 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+), 16 deletions(-)

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index ddce98b..1b4a3e4 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -62,23 +62,13 @@ The primary Machine Learning API for Spark is now the 
[DataFrame](sql-programmin
 
 # Dependencies
 
-MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), 
which depends on
-[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical 
processing.
-If native libraries[^1] are not available at runtime, you will see a warning 
message and a pure JVM
-implementation will be used instead.
+MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/) and 
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical 
processing[^1]. Those packages may call native acceleration libraries such as 
[Intel 
MKL](https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html)
 or [OpenBLAS](http://www.openblas.net) if they are available as system 
libraries or in runtime library paths. 
 
-Due to licensing issues with runtime proprietary binaries, we do not include 
`netlib-java`'s native
-proxies by default.
-To configure `netlib-java` / Breeze to use system optimised binaries, include
-`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as 
a dependency of your
-project and read the [netlib-java](https://github.com/fommil/netlib-java) 
documentation for your
-platform's additional installation instructions.
-
-The most popular native BLAS such as [Intel 
MKL](https://software.intel.com/en-us/mkl), 
[OpenBLAS](http://www.openblas.net), can use multiple threads in a single 
operation, which can conflict with Spark's execution model.
-
-Configuring these BLAS implementations to use a single thread for operations 
may actually improve performance (see 
[SPARK-21305](https://issues.apache.org/jira/browse/SPARK-21305)). It is 
usually optimal to match this to the number of cores each Spark task is 
configured to use, which is 1 by default and typically left at 1.
-
-Please refer to resources like the following to understand how to configure 
the number of threads these BLAS implementations use: [Intel 
MKL](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications)
 or [Intel 
oneMKL](https://software.intel.com/en-us/onemkl-linux-developer-guide-improving-performance-with-threading)
 and [OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded). 
Note that if nativeBLAS is n [...]
+Due to differing OSS licenses, `netlib-java`'s native proxies can't be 
distributed with Spark. See [MLlib Linear Algebra Acceleration 
Guide](ml-linalg-guide.html) for how to enable accelerated linear algebra 
processing. If accelerated native libraries are not enabled, you will see a 
warning message like below and a pure JVM implementation will be used instead:
+```
+WARN BLAS: Failed to load implementation 
from:com.github.fommil.netlib.NativeSystemBLAS
+WARN BLAS: Failed to load implementation 
from:com.github.fommil.netlib.NativeRefBLAS
+```
 
 To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 
1.4 or newer.
 
diff --git a/docs/ml-linalg-guide.md b/docs/ml-linalg-guide.md
new file mode 100644
index 0000000..7390913
--- /dev/null
+++ b/docs/ml-linalg-guide.md
@@ -0,0 +1,103 @@
+---
+layout: global
+title: MLlib Linear Algebra Acceleration Guide
+displayTitle: MLlib Linear Algebra Acceleration Guide
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+     http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+## Introduction
+
+This guide provides necessary information to enable accelerated linear algebra 
processing for Spark MLlib.
+
+Spark MLlib defines Vector and Matrix as basic data types for machine learning 
algorithms. On top of them, 
[BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and 
[LAPACK](https://en.wikipedia.org/wiki/LAPACK) operations are implemented and 
supported by [netlib-java](https://github.com/fommil/netlib-Java) (the 
algorithms may call [Breeze](https://github.com/scalanlp/breeze) and it will in 
turn call `netlib-java`). `netlib-java` can use optimized native linear algebra 
l [...]
+
+However due to license differences, the official released Spark binaries by 
default don't contain native libraries support for `netlib-java`.
+
+The following sections describe how to enable `netlib-java` with native 
libraries support for Spark MLlib and how to install native libraries and 
configure them properly.
+
+## Enable `netlib-java` with native library proxies 
+
+`netlib-java` depends on `libgfortran`. It requires GFORTRAN 1.4 or above. 
This can be obtained by installing `libgfortran` package. After installation, 
the following command can be used to verify if it is installed properly.
+```
+strings /path/to/libgfortran.so.3.0.0 | grep GFORTRAN_1.4
+```
+
+To build Spark with `netlib-java` native library proxies, you need to add 
`-Pnetlib-lgpl` to Maven build command line. For example:
+```
+$SPARK_SOURCE_HOME/build/mvn -Pnetlib-lgpl -DskipTests -Pyarn -Phadoop-2.7 
clean package
+```
+
+If you only want to enable it in your project, include 
`com.github.fommil.netlib:all:1.1.2` as a dependency of your project.
+
+## Install native linear algebra libraries
+
+Intel MKL and OpenBLAS are two popular native linear algebra libraries. You 
can choose one of them based on your preference. We provide basic instructions 
as below. You can refer to [netlib-java 
documentation](https://github.com/fommil/netlib-java) for more advanced 
installation instructions.
+
+### Intel MKL
+
+- Download and install Intel MKL. The installation should be done on all nodes 
of the cluster. We assume the installation location is $MKLROOT (e.g. 
/opt/intel/mkl).
+- Create soft links to `libmkl_rt.so` with specific names in system library 
search paths. For instance, make sure `/usr/local/lib` is in system library 
search paths and run the following commands:
+```
+$ ln -sf $MKLROOT/lib/intel64/libmkl_rt.so /usr/local/lib/libblas.so.3
+$ ln -sf $MKLROOT/lib/intel64/libmkl_rt.so /usr/local/lib/liblapack.so.3
+```
+
+### OpenBLAS
+
+The installation should be done on all nodes of the cluster. Generic version 
of OpenBLAS are available with most distributions. You can install it with a 
distribution package manager like `apt` or `yum`.
+
+For Debian / Ubuntu:
+```
+sudo apt-get install libopenblas-base
+sudo update-alternatives --config libblas.so.3
+```
+For CentOS / RHEL:
+```
+sudo yum install openblas
+```
+
+## Check if native libraries are enabled for MLlib
+
+To verify native libraries are properly loaded, start `spark-shell` and run 
the following code:
+```
+scala> import com.github.fommil.netlib.BLAS;
+scala> System.out.println(BLAS.getInstance().getClass().getName());
+```
+
+If they are correctly loaded, it should print 
`com.github.fommil.netlib.NativeSystemBLAS`. Otherwise the warnings should be 
printed:
+```
+WARN BLAS: Failed to load implementation 
from:com.github.fommil.netlib.NativeSystemBLAS
+WARN BLAS: Failed to load implementation 
from:com.github.fommil.netlib.NativeRefBLAS
+```
+
+If native libraries are not properly configured in the system, the Java 
implementation (f2jBLAS) will be used as fallback option.
+
+## Spark Configuration
+
+The default behavior of multi-threading in either Intel MKL or OpenBLAS may 
not be optimal with Spark's execution model [^1].
+
+Therefore configuring these native libraries to use a single thread for 
operations may actually improve performance (see 
[SPARK-21305](https://issues.apache.org/jira/browse/SPARK-21305)). It is 
usually optimal to match this to the number of `spark.task.cpus`, which is `1` 
by default and typically left at `1`.
+
+You can use the options in `config/spark-env.sh` to set thread number for 
Intel MKL or OpenBLAS:
+* For Intel MKL:
+```
+MKL_NUM_THREADS=1
+```
+* For OpenBLAS:
+```
+OPENBLAS_NUM_THREADS=1
+```
+
+[^1]: Please refer to the following resources to understand how to configure 
the number of threads for these BLAS implementations: [Intel 
MKL](https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications)
 or [Intel 
oneMKL](https://software.intel.com/en-us/onemkl-linux-developer-guide-improving-performance-with-threading)
 and [OpenBLAS](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded).


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs

Reply via email to