[SYSTEMML-1170] Clean Up Python Documentation For Next Release Cleanup of Python documentation.
Closes #335. Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/94cf7c15 Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/94cf7c15 Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/94cf7c15 Branch: refs/heads/gh-pages Commit: 94cf7c15b161a729f50ffec84e761b343e3ab2f9 Parents: 8268255 Author: Mike Dusenberry <[email protected]> Authored: Mon Jan 9 14:02:08 2017 -0800 Committer: Mike Dusenberry <[email protected]> Committed: Mon Jan 9 14:02:08 2017 -0800 ---------------------------------------------------------------------- README.md | 3 +- beginners-guide-python.md | 128 ++++++++++++++++++------------ index.md | 13 +-- spark-mlcontext-programming-guide.md | 66 +++++++-------- 4 files changed, 111 insertions(+), 99 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index 6906c8d..5a4b175 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,7 @@ Jekyll (and optionally Pygments) can be installed on the Mac OS in the following $ brew install ruby $ gem install jekyll $ gem install jekyll-redirect-from + $ gem install bundler $ brew install python $ pip install Pygments $ gem install pygments.rb @@ -38,4 +39,4 @@ documentation. From there, you can have Jekyll convert the markdown files to HTM Jekyll will serve up the generated documentation by default at http://127.0.0.1:4000. Modifications to *.md files will be converted to HTML and can be viewed in a web browser. - $ jekyll serve -w \ No newline at end of file + $ jekyll serve -w http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/beginners-guide-python.md ---------------------------------------------------------------------- diff --git a/beginners-guide-python.md b/beginners-guide-python.md index c919f3f..8bd957a 100644 --- a/beginners-guide-python.md +++ b/beginners-guide-python.md @@ -54,7 +54,8 @@ If you already have an Apache Spark installation, you can skip this step. /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" brew tap caskroom/cask brew install Caskroom/cask/java -brew install apache-spark +brew tap homebrew/versions +brew install apache-spark16 ``` </div> <div data-lang="Linux" markdown="1"> @@ -70,37 +71,60 @@ brew install apache-spark16 ### Install SystemML -We are working towards uploading the python package on pypi. Until then, please use following commands: +We are working towards uploading the python package on PyPi. Until then, please use following +commands: +<div class="codetabs"> +<div data-lang="Python 2" markdown="1"> ```bash git checkout https://github.com/apache/incubator-systemml.git cd incubator-systemml mvn clean package -P distribution pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz ``` - -The above commands will install Python package and place the corresponding Java binaries (along with algorithms) into the installed location. -To find the location of the downloaded Java binaries, use the following command: - +</div> +<div data-lang="Python 3" markdown="1"> ```bash -python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")' +git checkout https://github.com/apache/incubator-systemml.git +cd incubator-systemml +mvn clean package -P distribution +pip3 install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz ``` +</div> +</div> -Note: the user is free to either use the prepackaged Java binaries -or download them from [SystemML website](http://systemml.apache.org/download.html) -or build them from the [source](https://github.com/apache/incubator-systemml). - +### Uninstall SystemML To uninstall SystemML, please use following command: +<div class="codetabs"> +<div data-lang="Python 2" markdown="1"> ```bash -pip uninstall systemml-incubating +pip uninstall systemml ``` +</div> +<div data-lang="Python 3" markdown="1"> +```bash +pip3 uninstall systemml +``` +</div> +</div> ### Start Pyspark shell +<div class="codetabs"> +<div data-lang="Python 2" markdown="1"> ```bash -pyspark --master local[*] +pyspark ``` +</div> +<div data-lang="Python 3" markdown="1"> +```bash +PYSPARK_PYTHON=python3 pyspark +``` +</div> +</div> + +--- ## Matrix operations @@ -118,20 +142,20 @@ m4.sum(axis=1).toNumPy() Output: -```bash +```python array([[-60.], [-60.], [-60.]]) ``` Let us now write a simple script to train [linear regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression) -model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore regularization parameter as well as intercept. +model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore +regularization parameter as well as intercept. ```python import numpy as np from sklearn import datasets import systemml as sml -from pyspark.sql import SQLContext # Load the diabetes dataset diabetes = datasets.load_diabetes() # Use only one feature @@ -158,7 +182,10 @@ Output: Residual sum of squares: 25282.12 ``` -We can improve the residual error by adding an intercept and regularization parameter. To do so, we will use `mllearn` API described in the next section. +We can improve the residual error by adding an intercept and regularization parameter. To do so, we +will use `mllearn` API described in the next section. + +--- ## Invoke SystemML's algorithms @@ -206,7 +233,7 @@ algorithm on digits datasets. ```python # Scikit-learn way -from sklearn import datasets, neighbors +from sklearn import datasets from systemml.mllearn import LogisticRegression from pyspark.sql import SQLContext sqlCtx = SQLContext(sc) @@ -233,7 +260,7 @@ LogisticRegression score: 0.922222 To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method: ```python -from sklearn import datasets, neighbors +from sklearn import datasets from systemml.mllearn import LogisticRegression from pyspark.sql import SQLContext import pandas as pd @@ -245,7 +272,7 @@ X_digits = digits.data y_digits = digits.target n_samples = len(X_digits) # Split the data into training/testing sets and convert to PySpark DataFrame -df_train = sml.convertToLabeledDF(sqlContext, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)]) +df_train = sml.convertToLabeledDF(sqlCtx, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)]) X_test = sqlCtx.createDataFrame(pd.DataFrame(X_digits[int(.9 * n_samples):])) logistic = LogisticRegression(sqlCtx) logistic.fit(df_train) @@ -274,18 +301,18 @@ from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.sql import SQLContext sqlCtx = SQLContext(sc) training = sqlCtx.createDataFrame([ - (0L, "a b c d e spark", 1.0), - (1L, "b d", 2.0), - (2L, "spark f g h", 1.0), - (3L, "hadoop mapreduce", 2.0), - (4L, "b spark who", 1.0), - (5L, "g d a y", 2.0), - (6L, "spark fly", 1.0), - (7L, "was mapreduce", 2.0), - (8L, "e spark program", 1.0), - (9L, "a e c l", 2.0), - (10L, "spark compile", 1.0), - (11L, "hadoop software", 2.0) + (0, "a b c d e spark", 1.0), + (1, "b d", 2.0), + (2, "spark f g h", 1.0), + (3, "hadoop mapreduce", 2.0), + (4, "b spark who", 1.0), + (5, "g d a y", 2.0), + (6, "spark fly", 1.0), + (7, "was mapreduce", 2.0), + (8, "e spark program", 1.0), + (9, "a e c l", 2.0), + (10, "spark compile", 1.0), + (11, "hadoop software", 2.0) ], ["id", "text", "label"]) tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20) @@ -293,10 +320,10 @@ lr = LogisticRegression(sqlCtx) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) model = pipeline.fit(training) test = sqlCtx.createDataFrame([ - (12L, "spark i j k"), - (13L, "l m n"), - (14L, "mapreduce spark"), - (15L, "apache hadoop")], ["id", "text"]) + (12, "spark i j k"), + (13, "l m n"), + (14, "mapreduce spark"), + (15, "apache hadoop")], ["id", "text"]) prediction = model.transform(test) prediction.show() ``` @@ -304,27 +331,28 @@ prediction.show() Output: ```bash -+--+---------------+--------------------+--------------------+--------------------+---+----------+ -|id| text| words| features| probability| ID|prediction| -+--+---------------+--------------------+--------------------+--------------------+---+----------+ -|12| spark i j k|ArrayBuffer(spark...|(20,[5,6,7],[2.0,...|[0.99999999999975...|1.0| 1.0| -|13| l m n|ArrayBuffer(l, m, n)|(20,[8,9,10],[1.0...|[1.37552128844736...|2.0| 2.0| -|14|mapreduce spark|ArrayBuffer(mapre...|(20,[5,10],[1.0,1...|[0.99860290938153...|3.0| 1.0| -|15| apache hadoop|ArrayBuffer(apach...|(20,[9,14],[1.0,1...|[5.41688748236143...|4.0| 2.0| -+--+---------------+--------------------+--------------------+--------------------+---+----------+ ++-------+---+---------------+------------------+--------------------+--------------------+----------+ +|__INDEX| id| text| words| features| probability|prediction| ++-------+---+---------------+------------------+--------------------+--------------------+----------+ +| 1.0| 12| spark i j k| [spark, i, j, k]|(20,[5,6,7],[2.0,...|[0.99999999999975...| 1.0| +| 2.0| 13| l m n| [l, m, n]|(20,[8,9,10],[1.0...|[1.37552128844736...| 2.0| +| 3.0| 14|mapreduce spark|[mapreduce, spark]|(20,[5,10],[1.0,1...|[0.99860290938153...| 1.0| +| 4.0| 15| apache hadoop| [apache, hadoop]|(20,[9,14],[1.0,1...|[5.41688748236143...| 2.0| ++-------+---+---------------+------------------+--------------------+--------------------+----------+ ``` +--- + ## Invoking DML/PyDML scripts using MLContext The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml) using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide). ```python -from sklearn import datasets, neighbors -from pyspark.sql import DataFrame, SQLContext +from sklearn import datasets +from pyspark.sql import SQLContext import systemml as sml import pandas as pd -import os, imp sqlCtx = SQLContext(sc) digits = datasets.load_digits() X_digits = digits.data @@ -334,8 +362,8 @@ n_samples = len(X_digits) X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:int(.9 * n_samples)])) y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:int(.9 * n_samples)])) ml = sml.MLContext(sc) -# Get the path of MultiLogReg.dml -scriptPath = os.path.join(imp.find_module("systemml")[1], 'systemml-java', 'scripts', 'algorithms', 'MultiLogReg.dml') -script = sml.dml(scriptPath).input(X=X_df, Y_vec=y_df).output("B_out") +# Run the MultiLogReg.dml script at the given URL +scriptUrl = "https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/MultiLogReg.dml" +script = sml.dml(scriptUrl).input(X=X_df, Y_vec=y_df).output("B_out") beta = ml.execute(script).get('B_out').toNumPy() ``` http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/index.md ---------------------------------------------------------------------- diff --git a/index.md b/index.md index 6b91654..fe8361a 100644 --- a/index.md +++ b/index.md @@ -42,13 +42,11 @@ To download SystemML, visit the [downloads](http://systemml.apache.org/download) ## Running SystemML +* **[Beginner's Guide For Python Users](beginners-guide-python)** - Beginner's Guide for Python users. * **[Spark MLContext](spark-mlcontext-programming-guide)** - Spark MLContext is a programmatic API for running SystemML from Spark via Scala, Python, or Java. - * See the [Spark MLContext Programming Guide](spark-mlcontext-programming-guide) with the - following examples: - * [**Spark Shell (Scala)**](spark-mlcontext-programming-guide#spark-shell-example---new-api) - * [**Zeppelin Notebook (Scala)**](spark-mlcontext-programming-guide#zeppelin-notebook-example---linear-regression-algorithm---old-api) - * [**Jupyter Notebook (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization---old-api) + * [**Spark Shell Example (Scala)**](spark-mlcontext-programming-guide#spark-shell-example) + * [**Jupyter Notebook Example (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization) * **[Spark Batch](spark-batch-mode)** - Algorithms are automatically optimized to run across Spark clusters. * See [Invoking SystemML in Spark Batch Mode](spark-batch-mode) for detailed information. * **[Hadoop Batch](hadoop-batch-mode)** - Algorithms are automatically optimized when distributed across Hadoop clusters. @@ -62,16 +60,13 @@ machine in R-like and Python-like declarative languages. ## Language Guides +* [Python API Reference](python-reference) - API Reference Guide for Python users. * [DML Language Reference](dml-language-reference) - DML is a high-level R-like declarative language for machine learning. * **PyDML Language Reference** **(Coming Soon)** - PyDML is a high-level Python-like declarative language for machine learning. * [Beginner's Guide to DML and PyDML](beginners-guide-to-dml-and-pydml) - An introduction to the basics of DML and PyDML. -* [Beginner's Guide for Python users](beginners-guide-python) - -Beginner's Guide for Python users. -* [Reference Guide for Python users](python-reference) - -Reference Guide for Python users. ## ML Algorithms http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/spark-mlcontext-programming-guide.md ---------------------------------------------------------------------- diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md index fbc8f5b..dcaa125 100644 --- a/spark-mlcontext-programming-guide.md +++ b/spark-mlcontext-programming-guide.md @@ -35,14 +35,10 @@ such as Scala, Java, and Python. As a result, it offers a convenient way to inte Shell and from Notebooks such as Jupyter and Zeppelin. **NOTE: A new MLContext API has been redesigned for future SystemML releases. The old API is available -in all versions of SystemML but will be deprecated and removed, so please migrate to the new API.** +in previous versions of SystemML but is deprecated and will be removed soon, so please migrate to the new API.** -# Spark Shell Example - NEW API - -**NOTE: The new MLContext API will be available in future SystemML releases. It can be used -by building the project using Maven ('mvn clean package', or 'mvn clean package -P distribution'). -For SystemML version 0.10.0 and earlier, please see the documentation regarding the old API.** +# Spark Shell Example ## Start Spark Shell with SystemML @@ -1644,25 +1640,8 @@ scala> for (i <- 1 to 5) { # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization -Similar to the Scala API, SystemML also provides a Python MLContext API. In addition to the -regular `SystemML.jar` file, you'll need to install the Python API as follows: - - * Latest release: - * Python 2: - - ``` - pip install systemml - # Bleeding edge: pip install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python - ``` - - * Python 3: - - ``` - pip3 install systemml - # Bleeding edge: pip3 install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python - ``` - * Don't forget to download the `SystemML.jar` file, which can be found in the latest release, or - in a nightly build. +Similar to the Scala API, SystemML also provides a Python MLContext API. Before usage, you'll need +**[to install it first](beginners-guide-python#download--setup)**. Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/). This Jupyter notebook example can be nicely viewed in a rendered state @@ -1671,17 +1650,18 @@ and can be [downloaded here](https://raw.githubusercontent.com/apache/incubator- From the directory with the downloaded notebook, start Jupyter with PySpark: - * Python 2: - - ``` - PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar - ``` - - * Python 3: - - ``` - PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar - ``` +<div class="codetabs"> +<div data-lang="Python 2" markdown="1"> +```bash +PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark +``` +</div> +<div data-lang="Python 3" markdown="1"> +```bash +PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark +``` +</div> +</div> This will open Jupyter in a browser: @@ -1797,6 +1777,9 @@ plt.title('PNMF Training Loss') # Spark Shell Example - OLD API +### ** **NOTE: This API is old and has been deprecated.** ** +**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.** + ## Start Spark Shell with SystemML To use SystemML with the Spark Shell, the SystemML jar can be referenced using the Spark Shell's `--jars` option. @@ -2216,11 +2199,13 @@ val (min, max, mean) = minMaxMean(sysMlMatrix, numRows, numCols, ml) </div> - -* * * +--- # Zeppelin Notebook Example - Linear Regression Algorithm - OLD API +### ** **NOTE: This API is old and has been deprecated.** ** +**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.** + Next, we'll consider an example of a SystemML linear regression algorithm run from Spark through an Apache Zeppelin notebook. Instructions to clone and build Zeppelin can be found at the [GitHub Apache Zeppelin](https://github.com/apache/incubator-zeppelin) site. This example also will look at the Spark ML linear regression algorithm. @@ -2701,10 +2686,13 @@ Training time per iter: 0.2334166666666667 seconds {% endhighlight %} -* * * +--- # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization - OLD API +### ** **NOTE: This API is old and has been deprecated.** ** +**Please use the [new MLContext API](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization) instead.** + Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/). This Jupyter notebook example can be nicely viewed in a rendered state [on GitHub](https://github.com/apache/incubator-systemml/blob/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb),
