[SYSTEMML-849][SYSTEMML-457][SYSTEMML-458] Clean Up and Reorganize Documentation Targeted At Data Scientists
This update cleans up and reorganizes a lot of the existing documentation. The goal here is to work towards cleaning up our external message and targeting specific types of users in order to increase ease of adoption. My vision is that we target data scientists using Spark first and foremost. Without this focus in our documentation, the project is seemingly too confusing, and will deter this key user demographic from adoption. Once these users are onboard, engine developers will follow. This PR is a first effort towards this goal, and provides a nicer, cleaned-up version of the docs. We should collectively work to improve these, with clear separation between data scientists, systems engineers/researchers, etc. Closes #203. Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/5929b3b7 Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/5929b3b7 Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/5929b3b7 Branch: refs/heads/gh-pages Commit: 5929b3b7333e761dd6e1b440fa17a6e3b538c5cc Parents: c2b5c7e Author: Mike Dusenberry <[email protected]> Authored: Fri Aug 5 15:31:58 2016 -0700 Committer: Mike Dusenberry <[email protected]> Committed: Fri Aug 5 15:31:58 2016 -0700 ---------------------------------------------------------------------- Gemfile | 3 - _layouts/global.html | 2 +- developer-tools-systemml.md | 28 +- dml-language-reference.md | 178 +-------- engine-dev-guide.md | 56 +++ hadoop-batch-mode.md | 71 ++-- index.md | 53 ++- quick-start-guide.md | 399 -------------------- release-process.md | 42 +-- spark-batch-mode.md | 84 +++++ spark-mlcontext-programming-guide.md | 59 +-- standalone-guide.md | 582 ++++++++++++++++++++++++++++++ 12 files changed, 863 insertions(+), 694 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/Gemfile ---------------------------------------------------------------------- diff --git a/Gemfile b/Gemfile deleted file mode 100644 index b5d5550..0000000 --- a/Gemfile +++ /dev/null @@ -1,3 +0,0 @@ -source "https://rubygems.org" -gem "jekyll" -gem "rouge" \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/_layouts/global.html ---------------------------------------------------------------------- diff --git a/_layouts/global.html b/_layouts/global.html index f2c74bd..3527421 100644 --- a/_layouts/global.html +++ b/_layouts/global.html @@ -44,7 +44,7 @@ <ul class="dropdown-menu" role="menu"> <li><b>Running SystemML:</b></li> <li><a href="https://github.com/apache/incubator-systemml">SystemML GitHub README</a></li> - <li><a href="quick-start-guide.html">Quick Start Guide</a></li> + <li><a href="standalone-guide.html">Quick Start Guide</a></li> <li><a href="spark-mlcontext-programming-guide.html">Spark MLContext Programming Guide</a></li> <li><a href="hadoop-batch-mode.html">Hadoop Batch Mode</a> <li><a href="jmlc.html">Java Machine Learning Connector (JMLC)</a> http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/developer-tools-systemml.md ---------------------------------------------------------------------- diff --git a/developer-tools-systemml.md b/developer-tools-systemml.md index d874393..44069cb 100644 --- a/developer-tools-systemml.md +++ b/developer-tools-systemml.md @@ -28,6 +28,15 @@ Useful Tools for Developing SystemML: * This will become a table of contents (this text will be scraped). {:toc} +## IntelliJ + +IntelliJ can be used since it provides great support for mixed Java and Scala projects as described [here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ). + +### Import SystemML project to IntelliJ + + 1. Download IntelliJ and install the Scala plug-in for IntelliJ. + 2. Go to "File -> Import Project", locate the spark source directory, and select "Maven Project". + 3. In the Import wizard, it's fine to leave settings at their default. However it is usually useful to enable "Import Maven projects automatically", since changes to the project structure will automatically update the IntelliJ project. ## Eclipse @@ -40,7 +49,7 @@ To get started in Eclipse, import SystemML's pom.xml file as an existing Maven p  -## Eclipse with Scala +### Eclipse with Scala An additional Maven connector is required for working with Scala code in Eclipse. The [Maven Integration for Scala IDE](http://scala-ide.org/docs/tutorials/m2eclipse/) plugin can be installed into Eclipse from [this](http://alchim31.free.fr/m2e-scala/update-site/) update site. @@ -59,7 +68,7 @@ Note the corresponding Eclipse project needs to include the Scala nature. Typic  -## Eclipse Java Only (How to skip Scala) +### Eclipse Java Only (How to skip Scala) Since the core SystemML code is written in Java, developers may prefer not to use Eclipse in a mixed Java/Scala environment. To configure Eclipse to skip the Scala code of SystemML and avoid installing any Scala-related components, Maven lifecycle mappings can be created. The simplest way to create these mappings is to use Eclipse's quick fix option to resolve the following pom.xml errors which occur if Maven Integration for Scala is not present. @@ -69,7 +78,18 @@ Since the core SystemML code is written in Java, developers may prefer not to us The lifecycle mappings are stored in a workspace metadata file as specified in Eclipse's Maven Lifecycle Mappings Preferences page. The pom.xml file itself is unchanged which allows the Scala portion to be built outside of Eclipse using mvn command line. +## Troubleshooting -## IntelliJ +Please see below how to resolve some compilation issues that might occur after importing the SystemML project: + +##### `invalid cross-compiled libraries` error +Since Scala IDE bundles the latest versions (2.10.5 and 2.11.6 at this point), you need do add one in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your Scala 2.10.4 distribution. Once this is done, select all Spark projects and right-click, choose Scala -> Set Scala Installation and point to the 2.10.4 installation. This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now. + +##### `incompatible scala version ` error +Change IDE scala version `project->propertiest->scala compiler -> scala installation` to `Fixed scala Installation: 2.10.5` + +##### `Not found type * ` error +Run command `mvn package`, and do `project -> refresh` -IntelliJ can also be used since it provides good support for mixed Java and Scala projects as described [here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ). +##### `maketplace not found ` error for Eclipse Luna +Except scala IDE pulgin install, please make sure get update from "http://alchim31.free.fr/m2e-scala/update-site" to update maven connector for scala. http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/dml-language-reference.md ---------------------------------------------------------------------- diff --git a/dml-language-reference.md b/dml-language-reference.md index ee16085..82a7b24 100644 --- a/dml-language-reference.md +++ b/dml-language-reference.md @@ -61,15 +61,16 @@ limitations under the License. ## Introduction -SystemML compiles scripts written in Declarative Machine Learning (or DML for short) into MapReduce jobs. DMLâs syntax closely follows R, thereby minimizing the learning curve to use SystemML. Before getting into detail, letâs start with a simple Hello World program in DML. Assuming that Hadoop is installed on your machine or cluster, place SystemML.jar and SystemML-config.xml into your directory. Now, create a text file "hello.dml" containing following code: +SystemML compiles scripts written in Declarative Machine Learning (or DML for short) into mixed driver and distributed jobs. DMLâs syntax closely follows R, thereby minimizing the learning curve to use SystemML. Before getting into detail, letâs start with a simple Hello World program in DML. Assuming that Spark is installed on your machine or cluster, place `SystemML.jar` into your directory. Now, create a text file `hello.dml` containing following code: print("Hello World"); To run this program on your machine, use following command: - hadoop jar SystemML.jar âf hello.dml + spark-submit SystemML.jar -f hello.dml -The option `-f` in the above command refers to the path to the DML script. The detailed list of the options is given in the section "Invocation of SystemML". +The option `-f` in the above command refers to the path to the DML script. A detailed list of the +available options can be found running `spark-submit SystemML.jar -help`. ## Variables @@ -547,13 +548,13 @@ In above script, `ifdef(\$nbrRows, 10)` function is a short-hand for "`ifdef(\$n Letâs assume that the above script is invoked using following the command line values: - hadoop jar SystemML.jar -f test.dml -nvargs fname=test.mtx nbrRows=5 nbrCols=5 + spark-submit SystemML.jar -f test.dml -nvargs fname=test.mtx nbrRows=5 nbrCols=5 In this case, the script will create a random matrix M with 5 rows and 5 columns and write it to the file "text.mtx" in csv format. After that it will print the message "Done creating and writing random matrix in test.mtx" on the standard output. If however, the above script is invoked from the command line using named arguments: - hadoop jar SystemML.jar -f test.dml -nvargs fname=test.mtx nbrCols=5 + spark-submit SystemML.jar -f test.dml -nvargs fname=test.mtx nbrCols=5 Then, the script will instead create a random matrix M with 10 rows (i.e. default value provided in the script) and 5 columns. @@ -1539,170 +1540,3 @@ All reserved keywords are case-sensitive. String TRUE while - - -## Invocation of SystemML - -To execute a DML script, SystemML is invoked as follows: - - hadoop jar SystemML.jar [-? | -help | -f] <filename> (-config=<config_filename>)? (-args | -nvargs)? <args-list>? - -Where - -`-f <filename>: will be interpreted as a path to file with DML script. <filename> prefixed with hdfs or gpfs is assumed path in DFS, otherwise <filename> treated as path on local file system` - -`-debug: (optional) run in debug mode` - -`-config=<config_filename>: (optional) use config file located at specified path <config_filename>. <config_filename> prefixed with hdfs or gpfs is assumed path in DFS, otherwise <config_filename> treated as path on local file system (default value for <config_filename> is ./SystemML-config.xml)` - -`-args <args-list>: (optional) parameterize DML script with contents of <args-list>, which is ALL args after -args flag. Each argument must be an unnamed-argument, where 1st value after -args will replace \$1 in DML script, 2nd value will replace \$2 in DML script, and so on.` - -`-nvargs <args-list>: (optional) parameterize DML script with contents of <args-list>, which is ALL args after -nvargs flag. Each argument must be named-argument of form name=value, where value will replace \$name in DML script.` - -`-?, or -help: show this help.` - -NOTE: Please refer to section on Command-line Arguments for more details and restrictions on usage of command-line arguments to DML script using `âargs <args-list> and ânvargs <args-list>`. - - -### Examples - -Run a script in local file foo.dml: - - hadoop jar SystemML.jar -f foo.dml - -An example debug session: - -First, you need to call SystemML using âdebug flag. - - hadoop jar SystemML.jar -f test.dml âdebug - -You can see the line numbers in your DML script by "list" (or simply "l") command: - - (SystemMLdb) l - line   1: A = matrix("1 2 3 4 5 6", rows=3, cols=2) - line   2: - line   3: B = cumsum(A) - line   4: #print(B) - line   5: print(sum(B)) - -The next step is usually to set a breakpoint where we need to analyze the state of our variables: - - (SystemMLdb) b 5 - -Breakpoint added at .defaultNS::main, line 5. - -Now, that we have set a breakpoint, we can start running our DML script: - - (SystemMLdb) r - Breakpoint reached at .defaultNS::main instID 15: (line 5). - (SystemMLdb) p B - 1.0000  2.0000 - 4.0000 6.0000 - 9.0000 12.0000 - - -## MLContext API ----- - -The MLContext API allows users to pass RDDs as input/output to SystemML through Java, Scala, or Python. - -Typical usage for MLContext using Spark's Scala Shell is as follows: - - scala> import org.apache.sysml.api.MLContext - -Create input DataFrame from CSV file and potentially perform some feature transformation - - scala> val W = sqlContext.load(...) - scala> val H = sc.textFile("V.csv") - scala> val V = sc.textFile("V.text") - -Create MLContext - - scala> val ml = new MLContext(sc) - -Register input and output DataFrame/RDD - -Supported formats are: - - 1. DataFrame - 2. CSV/Text (as JavaRDD<String> or JavaPairRDD<LongWritable, Text>) - 3. Binary blocked RDD (JavaPairRDD<MatrixIndexes,MatrixBlock>)) - -Also overloaded to support metadata information such as format, rlen, -clen, etc. - -Please note the variable names given below in quotes correspond to the -variables in DML script. - -These variables need to have corresponding read/write associated in DML -script. - -Currently, only matrix variables are supported through -registerInput/registerOutput interface. - -To pass scalar variables, use named/positional arguments (described -later) or wrap them into matrix variable. - - scala> ml.registerInput("V", V) - scala> ml.registerInput("W", W, "csv") - scala> ml.registerInput("H", H, "text", 50, 1500) - scala> ml.registerOutput("H") - scala> ml.registerOutput("W") - -As DataFrame is internally converted to CSV format, one can skip -providing dimensions. - -Call script with default arguments: - - scala> val outputs = ml.execute("GNMF.dml") - -MLContext also supports calling script with positional arguments (args) -and named arguments (nargs): - - scala> val args = Array("V.mtx", "W.mtx", "H.mtx", "2000", "1500", "50", "1", "WOut.mtx", "HOut.mtx") - scala> val nargs = Map("maxIter"->"1") - scala> val outputs = ml.execute("GNMF.dml", args) # or ml.execute("GNMF.dml", nargs) - -We can then fetch the output RDDs in SystemMLâs binary blocked format or -as DataFrame. - - scala> val HOut = outputs.getDF(sqlContext, "H") - scala> val WOut = outputs. getBinaryBlockedRDD(sqlContext, "W") - -To register new input/outputs and to re-execute the script, it is -recommended that you first reset MLContext - - scala> ml.reset() - scala> ml.registerInput("V", newV) - -Though it is possible to re-run the script using different (or even same -arguments), but using same registered input/outputs without reset, it is -discouraged. This is because the symbol table entries would have been -updated since last invocation: - - scala> val new_outputs = ml.execute("GNMF.dml", new_args) - -The Python MLContext API is similar to Scala/Java MLContext API. Here is -an example: - - >>> from pyspark.sql import SQLContext - >>> from SystemML import MLContext - >>> sqlContext = SQLContext(sc) - >>> H = sqlContext.jsonFile("H.json") - >>> V = sqlContext.jsonFile("V.json") - >>> W = sqlContext.jsonFile("W.json") - >>> ml = MLContext(sc) - >>> ml.registerInput("V", V) - >>> ml.registerInput("W", W) - >>> ml.registerInput("H", H) - >>> ml.registerOutput("H") - >>> ml.registerOutput("W") - >>> outputs = ml.execute("GNMF.dml") - -Note: - -- The current version does not allow users to create multiple - MLContexts and only allows one thread to execute DML script using - the created MLContext. -- Even though the above example shows the usage through Scala/Python - Shell, it works for Spark-Submit and PySpark-Submit as well. http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/engine-dev-guide.md ---------------------------------------------------------------------- diff --git a/engine-dev-guide.md b/engine-dev-guide.md new file mode 100644 index 0000000..634846a --- /dev/null +++ b/engine-dev-guide.md @@ -0,0 +1,56 @@ +--- +layout: global +displayTitle: SystemML Engine Developer Guide +title: SystemML Engine Developer Guide +description: SystemML Engine Developer Guide +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> +* This will become a table of contents (this text will be scraped). +{:toc} + +## Building SystemML + +SystemML is built using [Apache Maven](http://maven.apache.org/). +SystemML will build on Linux, MacOS, or Windows, and requires Maven 3 and Java 7 (or higher). +To build SystemML, run: + + mvn clean package + +To build the SystemML distributions (`.tar.gz`, `.zip`, etc.), run: + + mvn clean package -P distribution + + +* * * + +## Testing SystemML + +SystemML features a comprehensive set of integration tests. To perform these tests, run: + + mvn verify + +Note: these tests require [R](https://www.r-project.org/) to be installed and available as part of the PATH variable on +the machine on which you are running these tests. + +If required, please install the following packages in R: + + install.packages(c("batch", "bitops", "boot", "caTools", "data.table", "doMC", "doSNOW", "ggplot2", "glmnet", "lda", "Matrix", "matrixStats", "moments", "plotrix", "psych", "reshape", "topicmodels", "wordcloud"), dependencies=TRUE) + +* * * http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/hadoop-batch-mode.md ---------------------------------------------------------------------- diff --git a/hadoop-batch-mode.md b/hadoop-batch-mode.md index 96b32a8..ddc1c1f 100644 --- a/hadoop-batch-mode.md +++ b/hadoop-batch-mode.md @@ -47,9 +47,7 @@ refer to the Hadoop documentation. # Hadoop Batch Mode Invocation Syntax -As described in the [Invocation of SystemML](dml-language-reference.html#invocation-of-systemml) section -of the [DML Language Reference](dml-language-reference.html), SystemML can be invoked in Hadoop Batch mode using -the following syntax: +SystemML can be invoked in Hadoop Batch mode using the following syntax: hadoop jar SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>) @@ -110,14 +108,14 @@ that I unpacked. I updated the `PATH` variable to include the `JAVA_HOME` `bin` and the `HADOOP_HOME` `sbin` directory. [hadoop@host1 ~]# vi .bash_profile - + ... export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 export HADOOP_HOME=/home/hadoop/hadoop-2.6.2 PATH=$JAVA_HOME/bin:$PATH:$HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export PATH ... - + [hadoop@host1 ~]$ source ~/.bash_profile To verify that Java and Hadoop were on the path, I used the `java -version` and `hadoop version` commands. @@ -167,7 +165,7 @@ arguments to the DML script were specified following the `-nvargs` option. 15/11/11 15:56:22 INFO api.DMLScript: SystemML Statistics: Total execution time: 0.288 sec. Number of executed MR Jobs: 0. - + 15/11/11 15:56:22 INFO api.DMLScript: END DML run 11/11/2015 15:56:22 In the console output, we see a warning that no default SystemML config file was found in the current working directory. @@ -207,7 +205,7 @@ To clean things up, I'll delete the files that were generated. Next, we'll look at running SystemML with Hadoop in Pseudo-Distributed mode. In Pseudo-Distributed mode, each Hadoop daemon (such as NameNode and DataNode) runs in a separate Java process on a single machine. -In the previous section about Hadoop Standalone mode, we set up the `JAVA_HOME` and `HADOOP_HOME` environment variables +In the previous section about Hadoop Standalone mode, we set up the `JAVA_HOME` and `HADOOP_HOME` environment variables and added `JAVA_HOME/bin`, `HADOOP_HOME/bin`, and `HADOOP_HOME/sbin` to the `PATH` in `.bash_profile`. We also need to set the `JAVA_HOME` value in the `hadoop-env.sh` file in the Hadoop configuration directory (`etc/hadoop`). @@ -215,7 +213,7 @@ We also need to set the `JAVA_HOME` value in the `hadoop-env.sh` file in the Had [hadoop@host1 hadoop]$ pwd /home/hadoop/hadoop-2.6.2/etc/hadoop [hadoop@host1 hadoop]$ vi hadoop-env.sh - + ... export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 ... @@ -247,8 +245,8 @@ a password. In the Hadoop configuration directory (`etc/hadoop`), in the `core-site.xml` file, we specify the `fs.defaultFS` property to be `localhost` with port `9000`. - [hadoop@host1 hadoop]$ vi core-site.xml - + [hadoop@host1 hadoop]$ vi core-site.xml + ... <configuration> <property> @@ -261,8 +259,8 @@ property to be `localhost` with port `9000`. By default, HDFS replicates data on three nodes. Since we're running on a single machine, we'll change this to one. We'll add a `dfs.replication` property to `hdfs-site.xml` and set its value to `1`. - [hadoop@host1 hadoop]$ vi hdfs-site.xml - + [hadoop@host1 hadoop]$ vi hdfs-site.xml + ... <configuration> <property> @@ -275,7 +273,7 @@ We'll add a `dfs.replication` property to `hdfs-site.xml` and set its value to ` Next, we'll format HDFS. [hadoop@host1 ~]$ hdfs namenode -format - 15/11/11 17:23:33 INFO namenode.NameNode: STARTUP_MSG: + 15/11/11 17:23:33 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = host1.example.com/9.30.252.15 @@ -342,7 +340,7 @@ Let's go ahead and execute the `genLinearRegressionData.dml` script in Hadoop Ps 15/11/11 18:16:35 INFO api.DMLScript: SystemML Statistics: Total execution time: 1.484 sec. Number of executed MR Jobs: 0. - + 15/11/11 18:16:35 INFO api.DMLScript: END DML run 11/11/2015 18:16:35 If we list the contents of the current directory in our regular file system, we see that no files have been written @@ -413,7 +411,7 @@ In the `yarn-site.xml` configuration file, we specify the `yarn.nodemanager.aux- to be `mapreduce_shuffle`. [hadoop@host1 hadoop]$ vi yarn-site.xml - + ... <configuration> <property> @@ -450,7 +448,7 @@ daemons (ResourceManager and NodeManager) are running. 51712 DataNode 51880 SecondaryNameNode -We can now view YARN information via the web interface on port 8088 (http://host1.example.com:8088). +We can now view YARN information via the web interface on port 8088 (http://host1.example.com:8088). I'll execute the `genLinearRegressionData.dml` example that we've previously considered. @@ -465,7 +463,7 @@ I'll execute the `genLinearRegressionData.dml` example that we've previously con 15/11/12 11:57:07 INFO api.DMLScript: SystemML Statistics: Total execution time: 1.265 sec. Number of executed MR Jobs: 0. - + 15/11/12 11:57:07 INFO api.DMLScript: END DML run 11/12/2015 11:57:07 If we examine the HDFS file system, we see the files generated by the execution of the DML script by SystemML on Hadoop. @@ -529,12 +527,12 @@ First, I created a hadoop user on each slave node. [root@host2 ~]# useradd hadoop [root@host2 ~]# passwd hadoop [root@host2 ~]# exit - + [root@host1 ~]$ ssh [email protected] [root@host2 ~]# useradd hadoop [root@host2 ~]# passwd hadoop [root@host2 ~]# exit - + [root@host1 ~]$ ssh [email protected] [root@host2 ~]# useradd hadoop [root@host2 ~]# passwd hadoop @@ -547,17 +545,17 @@ tested the passwordless login from the master node to each of the slave nodes fo user. $ ssh [email protected] - + [hadoop@host1 ~]$ ssh-copy-id host2.example.com [hadoop@host1 ~]$ ssh [email protected] Last login: Thu Nov 12 14:16:21 2015 [hadoop@host2 ~]$ exit - + [hadoop@host1 ~]$ ssh-copy-id host3.example.com [hadoop@host1 ~]$ ssh [email protected] Last login: Thu Nov 12 14:16:40 2015 [hadoop@host3 ~]$ exit - + [hadoop@host1 ~]$ ssh-copy-id host4.example.com [hadoop@host1 ~]$ ssh [email protected] Last login: Thu Nov 12 14:17:10 2015 @@ -575,7 +573,7 @@ On the master node, I specified the slave nodes in the Hadoop `slaves` configura In the `core-site.xml` file, I specified the `fs.defaultFS` property to reference the master node. [hadoop@host1 hadoop]$ more core-site.xml - + ... <configuration> <property> @@ -590,7 +588,7 @@ In the `hdfs-site.xml` configuration file, I removed the previous `dfs.replicati will use the default replication value (of 3). [hadoop@host1 hadoop]$ more hdfs-site.xml - + ... <configuration> </configuration> @@ -604,7 +602,7 @@ Furthermore, we'll set `mapreduce.map.memory.mb` and `mapreduce.reduce.memory.mb values are set to at least 1.5 times the value of the maximum heap size. [hadoop@host1 hadoop]$ more mapred-site.xml - + ... <configuration> <property> @@ -634,7 +632,7 @@ In the `yarn-site.xml` configuration file, I added a `yarn.resourcemanager.hostn the master node as the host. [hadoop@host1 hadoop]$ more yarn-site.xml - + ... <configuration> <property> @@ -652,7 +650,7 @@ In the previous example, we specified the `JAVA_HOME` in the `hadoop-env.sh` con We will use that same value. [hadoop@host1 hadoop]$ more hadoop-env.sh - + ... export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 ... @@ -744,13 +742,13 @@ If we look at the Hadoop (on port 50070) and YARN (on port 8088) web interfaces, Let's go ahead and run the SystemML example from the GitHub README. [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f genLinearRegressionData.dml -nvargs numSamples=1000 numFeatures=50 maxFeatureValue=5 maxWeight=5 addNoise=FALSE b=0 sparsity=0.7 output=linRegData.csv format=csv perc=0.5 - + [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/utils/sample.dml -nvargs X=linRegData.csv sv=perc.csv O=linRegDataParts ofmt=csv - + [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/utils/splitXY.dml -nvargs X=linRegDataParts/1 y=51 OX=linRegData.train.data.csv OY=linRegData.train.labels.csv ofmt=csv - + [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/utils/splitXY.dml -nvargs X=linRegDataParts/2 y=51 OX=linRegData.test.data.csv OY=linRegData.test.labels.csv ofmt=csv - + [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/LinearRegDS.dml -nvargs X=linRegData.train.data.csv Y=linRegData.train.labels.csv B=betas.csv fmt=csv ... BEGIN LINEAR REGRESSION SCRIPT @@ -773,7 +771,7 @@ Let's go ahead and run the SystemML example from the GitHub README. 15/11/17 15:50:34 INFO api.DMLScript: SystemML Statistics: Total execution time: 0.480 sec. ... - + [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/GLM-predict.dml -nvargs X=linRegData.test.data.csv Y=linRegData.test.labels.csv B=betas.csv fmt=csv ... LOGLHOOD_Z,,FALSE,NaN @@ -863,7 +861,7 @@ executing SystemML in Hadoop from my home directory rather than from the SystemM [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f genRandData4Kmeans.dml -config=systemml-{{site.SYSTEMML_VERSION}}/SystemML-config.xml -nvargs nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx C=C.mtx Y=Y.mtx YbyC=YbyC.mtx -After the data generation has finished, I'll check HDFS for the amount of space used. The 1M-row matrix `X.mtx` +After the data generation has finished, I'll check HDFS for the amount of space used. The 1M-row matrix `X.mtx` requires about 2.8GB of space. [hadoop@host1 ~]$ hdfs dfs -df -h @@ -895,7 +893,7 @@ Here we can see the `X.mtx` data files. -rw-r--r-- 1 hadoop supergroup 481624723 2015-11-19 11:56 X.mtx/2-r-00004 -rw-r--r-- 1 hadoop supergroup 481624048 2015-11-19 11:56 X.mtx/2-r-00005 -Next, I'll run the `Kmeans.dml` algorithm on the 1M-row matrix `X.mtx`. +Next, I'll run the `Kmeans.dml` algorithm on the 1M-row matrix `X.mtx`. [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/Kmeans.dml -config=/systemml-{{site.SYSTEMML_VERSION}}/SystemML-config.xml -nvargs X=X.mtx k=5 C=Centroids.mtx @@ -920,7 +918,7 @@ the `Kmeans-predict.dml` script. [hadoop@host1 ~]$ hadoop jar systemml-{{site.SYSTEMML_VERSION}}/SystemML.jar -f systemml-{{site.SYSTEMML_VERSION}}/algorithms/Kmeans-predict.dml -config=systemml-{{site.SYSTEMML_VERSION}}/SystemML-config.xml -nvargs X=X.mtx C=Centroids.mtx prY=PredY.mtx O=stats.txt -In the file system, we can see that the `PredY.mtx` matrix was created. +In the file system, we can see that the `PredY.mtx` matrix was created. The `stats.txt` file lists statistics about the results. [hadoop@host1 ~]$ hdfs dfs -ls @@ -1019,6 +1017,3 @@ in a clustered environment. </tr> </tbody> </table> - - - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/index.md ---------------------------------------------------------------------- diff --git a/index.md b/index.md index 56f4ed4..738e525 100644 --- a/index.md +++ b/index.md @@ -30,59 +30,58 @@ SystemML is a flexible, scalable machine learning system. SystemML's distinguishing characteristics are: 1. **Algorithm customizability via R-like and Python-like languages**. - 2. **Multiple execution modes**, including Standalone, Spark Batch, Spark MLContext, Hadoop Batch, and JMLC. + 2. **Multiple execution modes**, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC. 3. **Automatic optimization** based on data and cluster characteristics to ensure both efficiency and scalability. The [**SystemML GitHub README**](https://github.com/apache/incubator-systemml) describes -building, testing, and running SystemML. Please read [**Contributing to SystemML**](contributing-to-systemml.html) +building, testing, and running SystemML. Please read [**Contributing to SystemML**](contributing-to-systemml) to find out how to help make SystemML even better! -To download SystemML, visit the [downloads](http://systemml.apache.org/download.html) page. +To download SystemML, visit the [downloads](http://systemml.apache.org/download) page. ## Running SystemML -* **Standalone** - Standalone mode allows data scientists to rapidly prototype algorithms on a single +* **[Spark MLContext](spark-mlcontext-programming-guide)** - Spark MLContext is a programmatic API +for running SystemML from Spark via Scala, Python, or Java. + * See the [Spark MLContext Programming Guide](spark-mlcontext-programming-guide) with the + following examples: + * [**Spark Shell (Scala)**](spark-mlcontext-programming-guide#spark-shell-example---new-api) + * [**Zeppelin Notebook (Scala)**](spark-mlcontext-programming-guide#zeppelin-notebook-example---linear-regression-algorithm---old-api) + * [**Jupyter Notebook (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization---old-api) +* **[Spark Batch](spark-batch-mode)** - Algorithms are automatically optimized to run across Spark clusters. + * See [Invoking SystemML in Spark Batch Mode](spark-batch-mode) for detailed information. +* **[Hadoop Batch](hadoop-batch-mode)** - Algorithms are automatically optimized when distributed across Hadoop clusters. + * See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode) for detailed information. +* **[Standalone](standalone-guide)** - Standalone mode allows data scientists to rapidly prototype algorithms on a single machine in R-like and Python-like declarative languages. - * The [SystemML GitHub README](https://github.com/apache/incubator-systemml) describes - a linear regression example in Standalone Mode. - * The [Quick Start Guide](quick-start-guide.html) provides additional examples of algorithm execution + * The [Standalone Guide](standalone-guide) provides examples of algorithm execution in Standalone Mode. -* **Spark MLContext** - Spark MLContext is a programmatic API for running SystemML from Spark via Scala, Python, or Java. - * See the [Spark MLContext Programming Guide](spark-mlcontext-programming-guide.html) for - [**Spark Shell (Scala)**](spark-mlcontext-programming-guide.html#spark-shell-example), - [Java](spark-mlcontext-programming-guide.html#java-example), - [**Zeppelin Notebook**](spark-mlcontext-programming-guide.html#zeppelin-notebook-example---linear-regression-algorithm), - and [**Jupyter Notebook (PySpark)**](spark-mlcontext-programming-guide.html#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization) - examples. -* **Spark Batch** - Algorithms are automatically optimized to run across Spark clusters. - * See **Invoking SystemML in Spark Batch Mode** **(Coming soon)**. -* **Hadoop Batch** - Algorithms are automatically optimized when distributed across Hadoop clusters. - * See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode.html) for detailed information. -* **JMLC** - Java Machine Learning Connector. - * See [Java Machine Learning Connector (JMLC)](jmlc.html) for more information. +* **[JMLC](jmlc)** - Java Machine Learning Connector. + * See [Java Machine Learning Connector (JMLC)](jmlc) for more information. ## Language Guides -* [DML Language Reference](dml-language-reference.html) - +* [DML Language Reference](dml-language-reference) - DML is a high-level R-like declarative language for machine learning. * **PyDML Language Reference** **(Coming Soon)** - PyDML is a high-level Python-like declarative language for machine learning. -* [Beginner's Guide to DML and PyDML](beginners-guide-to-dml-and-pydml.html) - +* [Beginner's Guide to DML and PyDML](beginners-guide-to-dml-and-pydml) - An introduction to the basics of DML and PyDML. ## ML Algorithms -* [Algorithms Reference](algorithms-reference.html) - The Algorithms Reference describes the +* [Algorithms Reference](algorithms-reference) - The Algorithms Reference describes the machine learning algorithms included with SystemML in detail. ## Tools -* [Debugger Guide](debugger-guide.html) - SystemML supports DML script-level debugging through a +* [Debugger Guide](debugger-guide) - SystemML supports DML script-level debugging through a command-line interface. -* [IDE Guide](developer-tools-systemml.html) - Useful IDE Guide for Developing SystemML. +* [IDE Guide](developer-tools-systemml) - Useful IDE Guide for Developing SystemML. ## Other -* [Contributing to SystemML](contributing-to-systemml.html) - Describes ways to contribute to SystemML. -* [Troubleshooting Guide](troubleshooting-guide.html) - Troubleshoot various issues related to SystemML. +* [Contributing to SystemML](contributing-to-systemml) - Describes ways to contribute to SystemML. +* [Engine Developer Guide](engine-dev-guide) - Guide for internal SystemML engine development. +* [Troubleshooting Guide](troubleshooting-guide) - Troubleshoot various issues related to SystemML. http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/quick-start-guide.md ---------------------------------------------------------------------- diff --git a/quick-start-guide.md b/quick-start-guide.md deleted file mode 100644 index f05db25..0000000 --- a/quick-start-guide.md +++ /dev/null @@ -1,399 +0,0 @@ ---- -layout: global -title: SystemML Quick Start Guide -description: SystemML Quick Start Guide -displayTitle: SystemML Quick Start Guide ---- -<!-- -{% comment %} -Licensed to the Apache Software Foundation (ASF) under one or more -contributor license agreements. See the NOTICE file distributed with -this work for additional information regarding copyright ownership. -The ASF licenses this file to you under the Apache License, Version 2.0 -(the "License"); you may not use this file except in compliance with -the License. You may obtain a copy of the License at - -http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -{% endcomment %} ---> - -* This will become a table of contents (this text will be scraped). -{:toc} - -<br/> - -This tutorial provides a quick introduction to using SystemML by -running existing SystemML algorithms in standalone mode. - - -# What is SystemML - -SystemML enables large-scale machine learning (ML) via a high-level declarative -language with R-like syntax called [DML](dml-language-reference.html) and -Python-like syntax called PyDML. DML and PyDML allow data scientists to -express their ML algorithms with full flexibility but without the need to fine-tune -distributed runtime execution plans and system configurations. -These ML programs are dynamically compiled and optimized based on data -and cluster characteristics using rule-based and cost-based optimization techniques. -The compiler automatically generates hybrid runtime execution plans ranging -from in-memory, single node execution to distributed computation for Hadoop -or Spark Batch execution. -SystemML features a suite of algorithms for Descriptive Statistics, Classification, -Clustering, Regression, Matrix Factorization, and Survival Analysis. Detailed descriptions of these -algorithms can be found in the [Algorithms Reference](algorithms-reference.html). - -<br/> - -# Download SystemML - -Apache incubator releases of SystemML are available from the [downloads](http://systemml.apache.org/download.html) page. - -The SystemML project is available on GitHub at [https://github.com/apache/incubator-systemml](https://github.com/apache/incubator-systemml). -SystemML can be downloaded from GitHub and built with Maven. Instructions to build and -test SystemML can be found in the [SystemML GitHub README](https://github.com/apache/incubator-systemml). - -<br/> - -# Standalone vs Distributed Execution Mode - -SystemML's standalone mode is designed to allow data scientists to rapidly prototype algorithms -on a single machine. The standalone release packages all required libraries into a single distribution file. -In standalone mode, all operations occur on a single node in a non-Hadoop environment. Standalone mode -is not appropriate for large datasets. - -For large-scale production environments, SystemML algorithm execution can be -distributed across multi-node clusters using [Apache Hadoop](https://hadoop.apache.org/) -or [Apache Spark](http://spark.apache.org/). -We will make use of standalone mode throughout this tutorial. - -<br/> - -# Contents of the SystemML Standalone Package - -To follow along with this guide, first build a standalone package of SystemML -using [Apache Maven](http://maven.apache.org) -and unpack it. - - $ git clone https://github.com/apache/incubator-systemml.git - $ cd incubator-systemml - $ mvn clean package -P distribution - $ tar -xvzf target/systemml-*-standalone.tar.gz -C .. - $ cd .. - -The extracted package should have these contents: - - $ ls -lF systemml-{{site.SYSTEMML_VERSION}}/ - total 96 - -rw-r--r-- LICENSE - -rw-r--r-- NOTICE - -rw-r--r-- SystemML-config.xml - drwxr-xr-x docs/ - drwxr-xr-x lib/ - -rw-r--r-- log4j.properties - -rw-r--r-- readme.txt - -rwxr-xr-x runStandaloneSystemML.bat* - -rwxr-xr-x runStandaloneSystemML.sh* - drwxr-xr-x scripts/ - -For the rest of the tutorial we will switch to the `systemml-{{site.SYSTEMML_VERSION}}` directory. - - $ cd ~/systemml-{{site.SYSTEMML_VERSION}} - -Note that standalone mode supports both Mac/UNIX and Windows. To run the following examples on -Windows, the "`./runStandaloneSystemML.sh ...`" commands can be replaced with -"`./runStandaloneSystemML.bat ...`" commands. - -<br/> - -# Choosing Test Data - -In this tutorial we will use the [Haberman's Survival Data Set](http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) -which can be downloaded in CSV format from the [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/) - - $ wget -P data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data - -The [Haberman Data Set](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.names) -has 306 instances and 4 attributes (including the class attribute): - - 1. Age of patient at time of operation (numerical) - 2. Patient's year of operation (year - 1900, numerical) - 3. Number of positive axillary nodes detected (numerical) - 4. Survival status (class attribute) - * `1` = the patient survived 5 years or longer - * `2` = the patient died within 5 year - - -We will need to create a metadata file (MTD) which stores metadata information -about the content of the data file. The name of the MTD file associated with the -data file `<filename>` must be `<filename>.mtd`. - - $ echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd - -<br/> - -# Example 1 - Univariate Statistics - -Let's start with a simple example, computing certain [univariate statistics](algorithms-descriptive-statistics.html#univariate-statistics) -for each feature column using the algorithm `Univar-Stats.dml` which requires 3 -[arguments](algorithms-descriptive-statistics.html#arguments): - -* `X`: location of the input data file to analyze -* `TYPES`: location of the file that contains the feature column types encoded by integer numbers: `1` = scale, `2` = nominal, `3` = ordinal -* `STATS`: location where the output matrix of computed statistics is to be stored - -We need to create a file `types.csv` that describes the type of each column in -the data along with its metadata file `types.csv.mtd`. - - $ echo '1,1,1,2' > data/types.csv - $ echo '{"rows": 1, "cols": 4, "format": "csv"}' > data/types.csv.mtd - - -To run the `Univar-Stats.dml` algorithm, issue the following command (we set the optional argument `CONSOLE_OUTPUT` to `TRUE` to print the statistics to the console): - - $ ./runStandaloneSystemML.sh scripts/algorithms/Univar-Stats.dml -nvargs X=data/haberman.data TYPES=data/types.csv STATS=data/univarOut.mtx CONSOLE_OUTPUT=TRUE - - [...] - ------------------------------------------------- - Feature [1]: Scale - (01) Minimum | 30.0 - (02) Maximum | 83.0 - (03) Range | 53.0 - (04) Mean | 52.45751633986928 - (05) Variance | 116.71458266366658 - (06) Std deviation | 10.803452349303281 - (07) Std err of mean | 0.6175922641866753 - (08) Coeff of variation | 0.20594669940735139 - (09) Skewness | 0.1450718616532357 - (10) Kurtosis | -0.6150152487211726 - (11) Std err of skewness | 0.13934809593495995 - (12) Std err of kurtosis | 0.277810485320835 - (13) Median | 52.0 - (14) Interquartile mean | 52.16013071895425 - ------------------------------------------------- - Feature [2]: Scale - (01) Minimum | 58.0 - (02) Maximum | 69.0 - (03) Range | 11.0 - (04) Mean | 62.85294117647059 - (05) Variance | 10.558630665380907 - (06) Std deviation | 3.2494046632238507 - (07) Std err of mean | 0.18575610076612029 - (08) Coeff of variation | 0.051698529971741194 - (09) Skewness | 0.07798443581479181 - (10) Kurtosis | -1.1324380182967442 - (11) Std err of skewness | 0.13934809593495995 - (12) Std err of kurtosis | 0.277810485320835 - (13) Median | 63.0 - (14) Interquartile mean | 62.80392156862745 - ------------------------------------------------- - Feature [3]: Scale - (01) Minimum | 0.0 - (02) Maximum | 52.0 - (03) Range | 52.0 - (04) Mean | 4.026143790849673 - (05) Variance | 51.691117539912135 - (06) Std deviation | 7.189653506248555 - (07) Std err of mean | 0.41100513466216837 - (08) Coeff of variation | 1.7857418611299172 - (09) Skewness | 2.954633471088322 - (10) Kurtosis | 11.425776549251449 - (11) Std err of skewness | 0.13934809593495995 - (12) Std err of kurtosis | 0.277810485320835 - (13) Median | 1.0 - (14) Interquartile mean | 1.2483660130718954 - ------------------------------------------------- - Feature [4]: Categorical (Nominal) - (15) Num of categories | 2 - (16) Mode | 1 - (17) Num of modes | 1 - - -The `Univar-Stats.dml` script writes the computed statistics to the `univarOut.mtx` file. The matrix has one row per univariate statistic and one column per input feature. The first column gives the number of the statistic -(see above table), the second column gives the number of the feature column in -the input data, and the third column gives the value of the univariate statistic. - - 1 1 30.0 - 1 2 58.0 - 2 1 83.0 - 2 2 69.0 - 2 3 52.0 - 3 1 53.0 - 3 2 11.0 - 3 3 52.0 - 4 1 52.45751633986928 - 4 2 62.85294117647059 - 4 3 4.026143790849673 - 5 1 116.71458266366658 - 5 2 10.558630665380907 - 5 3 51.691117539912135 - 6 1 10.803452349303281 - 6 2 3.2494046632238507 - 6 3 7.189653506248555 - 7 1 0.6175922641866753 - 7 2 0.18575610076612029 - 7 3 0.41100513466216837 - 8 1 0.20594669940735139 - 8 2 0.051698529971741194 - 8 3 1.7857418611299172 - 9 1 0.1450718616532357 - 9 2 0.07798443581479181 - 9 3 2.954633471088322 - 10 1 -0.6150152487211726 - 10 2 -1.1324380182967442 - 10 3 11.425776549251449 - 11 1 0.13934809593495995 - 11 2 0.13934809593495995 - 11 3 0.13934809593495995 - 12 1 0.277810485320835 - 12 2 0.277810485320835 - 12 3 0.277810485320835 - 13 1 52.0 - 13 2 63.0 - 13 3 1.0 - 14 1 52.16013071895425 - 14 2 62.80392156862745 - 14 3 1.2483660130718954 - 15 4 2.0 - 16 4 1.0 - 17 4 1.0 - - -<br/> -<br/> - -# Example 2 - Binary-class Support Vector Machines - -Let's take the same `haberman.data` to explore the -[binary-class support vector machines](algorithms-classification.html#binary-class-support-vector-machines) algorithm `l2-svm.dml`. -This example also illustrates how to use of the sampling algorithm `sample.dml` -and the data split algorithm `spliXY.dml`. - -## Sampling the Test Data - -First we need to use the `sample.dml` algorithm to separate the input into one -training data set and one data set for model prediction. - -Parameters: - - * `X` : (input) input data set: filename of input data set - * `sv` : (input) sampling vector: filename of 1-column vector w/ percentages. sum(sv) must be 1. - * `O` : (output) folder name w/ samples generated - * `ofmt` : (output) format of O: "csv", "binary" (default) - - -We will create the file `perc.csv` and `perc.csv.mtd` to define the sampling vector with a sampling rate of -50% to generate 2 data sets: - - $ printf "0.5\n0.5" > data/perc.csv - $ echo '{"rows": 2, "cols": 1, "format": "csv"}' > data/perc.csv.mtd - -Let's run the sampling algorithm to create the two data samples: - - $ ./runStandaloneSystemML.sh scripts/utils/sample.dml -nvargs X=data/haberman.data sv=data/perc.csv O=data/haberman.part ofmt="csv" - - -## Splitting Labels from Features - -Next we use the `splitXY.dml` algorithm to separate the feature columns from -the label column(s). - -Parameters: - - * `X` : (input) filename of data matrix - * `y` : (input) colIndex: starting index is 1 - * `OX` : (output) filename of output matrix with all columns except y - * `OY` : (output) filename of output matrix with y column - * `ofmt` : (output) format of OX and OY output matrix: "csv", "binary" (default) - -We specify `y=4` as the 4th column contains the labels to be predicted and run -the `splitXY.dml` algorithm on our training and test data sets. - - $ ./runStandaloneSystemML.sh scripts/utils/splitXY.dml -nvargs X=data/haberman.part/1 y=4 OX=data/haberman.train.data.csv OY=data/haberman.train.labels.csv ofmt="csv" - - $ ./runStandaloneSystemML.sh scripts/utils/splitXY.dml -nvargs X=data/haberman.part/2 y=4 OX=data/haberman.test.data.csv OY=data/haberman.test.labels.csv ofmt="csv" - -## Training and Testing the Model - -Now we need to train our model using the `l2-svm.dml` algorithm. - -[Parameters](algorithms-classification.html#arguments-1): - - * `X` : (input) filename of training data features - * `Y` : (input) filename of training data labels - * `model` : (output) filename of model that contains the learnt weights - * `fmt` : (output) format of model: "csv", "text" (sparse-matrix) - * `Log` : (output) log file for metrics and progress while training - * `confusion` : (output) filename of confusion matrix computed using a held-out test set (optional) - -The `l2-svm.dml` algorithm is used on our training data sample to train the model. - - $ ./runStandaloneSystemML.sh scripts/algorithms/l2-svm.dml -nvargs X=data/haberman.train.data.csv Y=data/haberman.train.labels.csv model=data/l2-svm-model.csv fmt="csv" Log=data/l2-svm-log.csv - -The `l2-svm-predict.dml` algorithm is used on our test data sample to predict the labels based on the trained model. - - $ ./runStandaloneSystemML.sh scripts/algorithms/l2-svm-predict.dml -nvargs X=data/haberman.test.data.csv Y=data/haberman.test.labels.csv model=data/l2-svm-model.csv fmt="csv" confusion=data/l2-svm-confusion.csv - -The console output should show the accuracy of the trained model in percent, i.e.: - - 15/09/01 01:32:51 INFO api.DMLScript: BEGIN DML run 09/01/2015 01:32:51 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating localtmpdir with value /tmp/systemml - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating scratch with value scratch_space - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating optlevel with value 2 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating numreducers with value 10 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating jvmreuse with value false - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating defaultblocksize with value 1000 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster with value false - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster.mem with value 2048 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.mapreduce.mem with value 2048 - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.app.queue with value default - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.matrixmult with value true - 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.textio with value true - Accuracy (%): 74.14965986394557 - 15/09/01 01:32:52 INFO api.DMLScript: SystemML Statistics: - Total execution time: 0.130 sec. - Number of executed MR Jobs: 0. - -The generated file `l2-svm-confusion.csv` should contain the following confusion matrix of this form: - - |0 1.0 2.0| - |1.0 t1 t2 | - |2.0 t3 t4 | - - * The model correctly predicted label 1 `t1` times - * The model incorrectly predicted label 1 as opposed to label 2 `t2` times - * The model incorrectly predicted label 2 as opposed to label 1 `t3` times - * The model correctly predicted label 2 `t4` times. - -If the confusion matrix looks like this ... - - 0,1.0,2.0 - 1.0,107.0,38.0 - 2.0,0.0,2.0 - -... then the accuracy of the model is (t1+t4)/(t1+t2+t3+t4) = (107+2)/107+38+0+2) = 0.741496599 - -<br/> - -Refer to the [Algorithms Reference](algorithms-reference.html) for more details. - -<br/> - -# Troubleshooting - -If you encounter a `"java.lang.OutOfMemoryError"` you can edit the invocation -script (`runStandaloneSystemML.sh` or `runStandaloneSystemML.bat`) to increase -the memory available to the JVM, i.e: - - java -Xmx16g -Xms4g -Xmn1g -cp ${CLASSPATH} org.apache.sysml.api.DMLScript \ - -f ${SCRIPT_FILE} -exec singlenode -config=SystemML-config.xml \ - $@ - -<br/> - -`this is code` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/release-process.md ---------------------------------------------------------------------- diff --git a/release-process.md b/release-process.md index ded33e9..ed78a32 100644 --- a/release-process.md +++ b/release-process.md @@ -84,23 +84,23 @@ this OS X example. # download artifacts wget -r -nH -nd -np -R index.html* https://dist.apache.org/repos/dist/dev/incubator/systemml/0.10.0-incubating-rc1/ - + # verify standalone tar.gz works tar -xvzf systemml-0.10.0-incubating-standalone.tar.gz cd systemml-0.10.0-incubating-standalone echo "print('hello world');" > hello.dml ./runStandaloneSystemML.sh hello.dml cd .. - + # verify main jar works mkdir lib cp -R systemml-0.10.0-incubating-standalone/lib/* lib/ rm lib/systemml-0.10.0-incubating.jar java -cp ./lib/*:systemml-0.10.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');" - + # verify standalone jar works java -jar systemml-0.10.0-incubating-standalone.jar -s "print('hello world');" - + # verify src works tar -xvzf systemml-0.10.0-incubating-src.tar.gz cd systemml-0.10.0-incubating-src @@ -111,21 +111,21 @@ this OS X example. java -jar systemml-0.10.0-incubating-standalone.jar -s "print('hello world');" cd .. cd .. - + # verify in-memory jar works echo "import org.apache.sysml.api.jmlc.*;public class JMLCEx {public static void main(String[] args) throws Exception {Connection conn = new Connection();PreparedScript script = conn.prepareScript(\"print('hello world');\", new String[]{}, new String[]{}, false);script.executeScript();}}" > JMLCEx.java javac -cp systemml-0.10.0-incubating-inmemory.jar JMLCEx.java java -cp .:systemml-0.10.0-incubating-inmemory.jar JMLCEx - + # verify distrib tar.gz works tar -xvzf systemml-0.10.0-incubating.tar.gz cd systemml-0.10.0-incubating java -cp ../lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');" - + # verify spark batch mode export SPARK_HOME=/Users/deroneriksson/spark-1.5.1-bin-hadoop2.6 $SPARK_HOME/bin/spark-submit SystemML.jar -s "print('hello world');" -exec hybrid_spark - + # verify hadoop batch mode hadoop jar SystemML.jar -s "print('hello world');" @@ -135,18 +135,18 @@ sanity check on OS X after building the artifacts manually. # build distribution artifacts mvn clean package -P distribution - + cd target - + # verify main jar works java -cp ./lib/*:systemml-0.10.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');" - + # verify SystemML.jar works java -cp ./lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');" - + # verify standalone jar works java -jar systemml-0.10.0-incubating-standalone.jar -s "print('hello world');" - + # verify src works tar -xvzf systemml-0.10.0-incubating-src.tar.gz cd systemml-0.10.0-incubating-src @@ -157,28 +157,28 @@ sanity check on OS X after building the artifacts manually. java -jar systemml-0.10.0-incubating-standalone.jar -s "print('hello world');" cd .. cd .. - + # verify in-memory jar works echo "import org.apache.sysml.api.jmlc.*;public class JMLCEx {public static void main(String[] args) throws Exception {Connection conn = new Connection();PreparedScript script = conn.prepareScript(\"print('hello world');\", new String[]{}, new String[]{}, false);script.executeScript();}}" > JMLCEx.java javac -cp systemml-0.10.0-incubating-inmemory.jar JMLCEx.java java -cp .:systemml-0.10.0-incubating-inmemory.jar JMLCEx - + # verify standalone tar.gz works tar -xvzf systemml-0.10.0-incubating-standalone.tar.gz cd systemml-0.10.0-incubating-standalone echo "print('hello world');" > hello.dml ./runStandaloneSystemML.sh hello.dml cd .. - + # verify distrib tar.gz works tar -xvzf systemml-0.10.0-incubating.tar.gz cd systemml-0.10.0-incubating java -cp ../lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');" - + # verify spark batch mode export SPARK_HOME=/Users/deroneriksson/spark-1.5.1-bin-hadoop2.6 $SPARK_HOME/bin/spark-submit SystemML.jar -s "print('hello world');" -exec hybrid_spark - + # verify hadoop batch mode hadoop jar SystemML.jar -s "print('hello world');" @@ -222,7 +222,7 @@ The standalone tar.gz and zip artifacts contain `runStandaloneSystemML.sh` and ` files. Verify that one or more algorithms can be run on a single node using these standalone distributions. -Here is an example based on the [Quick Start Guide](http://apache.github.io/incubator-systemml/quick-start-guide.html) +Here is an example based on the [Standalone Guide](http://apache.github.io/incubator-systemml/standalone-guide.html) demonstrating the execution of an algorithm (on OS X). $ tar -xvzf systemml-0.10.0-incubating-standalone.tar.gz @@ -276,7 +276,3 @@ For examples, see the [Spark MLContext Programming Guide](http://apache.github.i Verify that the performance suite located at scripts/perftest/ executes on Spark and Hadoop. Testing should include 80MB, 800MB, 8GB, and 80GB data sizes. - - - - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/spark-batch-mode.md ---------------------------------------------------------------------- diff --git a/spark-batch-mode.md b/spark-batch-mode.md new file mode 100644 index 0000000..c199b1f --- /dev/null +++ b/spark-batch-mode.md @@ -0,0 +1,84 @@ +--- +layout: global +title: Invoking SystemML in Spark Batch Mode +description: Invoking SystemML in Spark Batch Mode +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +* This will become a table of contents (this text will be scraped). +{:toc} + +<br/> + + +# Overview + +Given that a primary purpose of SystemML is to perform machine learning on large distributed data +sets, one of the most important ways to invoke SystemML is Spark Batch. Here, we will look at this +mode in more depth. + +**NOTE:** For a programmatic API to run and interact with SystemML via Scala or Python, please see the +[Spark MLContext Programming Guide](spark-mlcontext-programming-guide). + +--- + +# Spark Batch Mode Invocation Syntax + +SystemML can be invoked in Hadoop Batch mode using the following syntax: + + spark-submit SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>) + +The DML script to invoke is specified after the `-f` argument. Configuration settings can be passed to SystemML +using the optional `-config=` argument. DML scripts can optionally take named arguments (`-nvargs`) or positional +arguments (`-args`). Named arguments are preferred over positional arguments. Positional arguments are considered +to be deprecated. All the primary algorithm scripts included with SystemML use named arguments. + + +**Example #1: DML Invocation with Named Arguments** + + spark-submit systemml/SystemML.jar -f systemml/algorithms/Kmeans.dml -nvargs X=X.mtx k=5 + + +**Example #2: DML Invocation with Positional Arguments** + + spark-submit systemml/SystemML.jar -f example/test/LinearRegression.dml -args "v" "y" 0.00000001 "w" + +# Execution modes + +SystemML works seamlessly with all Spark execution modes, including *local* (`--master local[*]`), +*yarn client* (`--master yarn-client`), *yarn cluster* (`--master yarn-cluster`), *etc*. More +information on Spark cluster execution modes can be found on the +[official Spark cluster deployment documentation](https://spark.apache.org/docs/latest/cluster-overview.html). +*Note* that Spark can be easily run on a laptop in local mode using the `--master local[*]` described +above, which SystemML supports. + +# Recommended Spark Configuration Settings + +For best performance, we recommend setting the following flags when running SystemML with Spark: +`--conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128`. + +# Examples + +Please see the MNIST examples in the included +[SystemML-NN](https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN) +library for examples of Spark Batch mode execution with SystemML to train MNIST classifiers: + + * [MNIST Softmax Classifier](https://github.com/apache/incubator-systemml/blob/master/scripts/staging/SystemML-NN/examples/mnist_softmax-train.dml) + * [MNIST LeNet ConvNet](https://github.com/apache/incubator-systemml/blob/master/scripts/staging/SystemML-NN/examples/mnist_lenet-train.dml) http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5929b3b7/spark-mlcontext-programming-guide.md ---------------------------------------------------------------------- diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md index 2eaf1be..6c2d2af 100644 --- a/spark-mlcontext-programming-guide.md +++ b/spark-mlcontext-programming-guide.md @@ -108,7 +108,7 @@ ml.execute(helloScript) <div data-lang="Spark Shell" markdown="1"> {% highlight scala %} scala> val helloScript = dml("print('hello world')") -helloScript: org.apache.sysml.api.mlcontext.Script = +helloScript: org.apache.sysml.api.mlcontext.Script = Inputs: None @@ -117,7 +117,7 @@ None scala> ml.execute(helloScript) hello world -res0: org.apache.sysml.api.mlcontext.MLResults = +res0: org.apache.sysml.api.mlcontext.MLResults = None {% endhighlight %} @@ -214,7 +214,7 @@ scala> val minMaxMean = | maxOut = max(Xin) | meanOut = mean(Xin) | """ -minMaxMean: String = +minMaxMean: String = " minOut = min(Xin) maxOut = max(Xin) @@ -307,7 +307,7 @@ scala> val sums = """ | message = "s1 and s2 are equal" | } | """ -sums: String = +sums: String = " s1 = sum(m1); s2 = sum(m2); @@ -323,7 +323,7 @@ if (s1 > s2) { scala> scala.tools.nsc.io.File("sums.dml").writeAll(sums) scala> val sumScript = dmlFromFile("sums.dml").in(Map("m1"-> rdd1, "m2"-> rdd2)).out("s1", "s2", "message") -sumScript: org.apache.sysml.api.mlcontext.Script = +sumScript: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m1: ParallelCollectionRDD[42] at parallelize at <console>:38 [2] (RDD) m2: ParallelCollectionRDD[43] at parallelize at <console>:38 @@ -334,7 +334,7 @@ Outputs: [3] message scala> val sumResults = ml.execute(sumScript) -sumResults: org.apache.sysml.api.mlcontext.MLResults = +sumResults: org.apache.sysml.api.mlcontext.MLResults = [1] (Double) s1: 10.0 [2] (Double) s2: 26.0 [3] (String) message: s2 is greater @@ -378,7 +378,7 @@ scala> val rdd2Metadata = new MatrixMetadata(2, 2) rdd2Metadata: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 2, columns: 2, non-zeros: None, rows per block: None, columns per block: None scala> val sumScript = dmlFromFile("sums.dml").in(Seq(("m1", rdd1, rdd1Metadata), ("m2", rdd2, rdd2Metadata))).out("s1", "s2", "message") -sumScript: org.apache.sysml.api.mlcontext.Script = +sumScript: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m1: ParallelCollectionRDD[42] at parallelize at <console>:38 [2] (RDD) m2: ParallelCollectionRDD[43] at parallelize at <console>:38 @@ -416,7 +416,7 @@ val (firstSum, secondSum, sumMessage) = ml.execute(sumScript).getTuple[Double, D <div data-lang="Spark Shell" markdown="1"> {% highlight scala %} scala> val sumScript = dmlFromFile("sums.dml").in("m1", rdd1, rdd1Metadata).in("m2", rdd2, rdd2Metadata).out("s1").out("s2").out("message") -sumScript: org.apache.sysml.api.mlcontext.Script = +sumScript: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m1: ParallelCollectionRDD[42] at parallelize at <console>:38 [2] (RDD) m2: ParallelCollectionRDD[43] at parallelize at <console>:38 @@ -445,7 +445,7 @@ Let's look at an example of reading a matrix out of SystemML. We'll create a DML in which we create a 2x2 matrix `m`. We'll set the variable `n` to be the sum of the cells in the matrix. We create a script object using String `s`, and we set `m` and `n` as the outputs. We execute the script, and in -the results we see we have Matrix `m` and Double `n`. The `n` output variable has a value of `110.0`. +the results we see we have Matrix `m` and Double `n`. The `n` output variable has a value of `110.0`. We get Matrix `m` and Double `n` as a Tuple of values `x` and `y`. We then convert Matrix `m` to an RDD of IJV values, an RDD of CSV values, a DataFrame, and a two-dimensional Double Array, and we display @@ -478,14 +478,14 @@ scala> val s = | m = matrix("11 22 33 44", rows=2, cols=2) | n = sum(m) | """ -s: String = +s: String = " m = matrix("11 22 33 44", rows=2, cols=2) n = sum(m) " scala> val scr = dml(s).out("m", "n"); -scr: org.apache.sysml.api.mlcontext.Script = +scr: org.apache.sysml.api.mlcontext.Script = Inputs: None @@ -495,7 +495,7 @@ Outputs: scala> val res = ml.execute(scr) -res: org.apache.sysml.api.mlcontext.MLResults = +res: org.apache.sysml.api.mlcontext.MLResults = [1] (Matrix) m: Matrix: scratch_space//_p12059_9.31.117.12//_t0/temp26_14, [2 x 2, nnz=4, blocks (1000 x 1000)], binaryblock, dirty [2] (Double) n: 110.0 @@ -588,7 +588,7 @@ scala> val scriptUrl = "https://raw.githubusercontent.com/apache/incubator-syste scriptUrl: String = https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/Univar-Stats.dml scala> val uni = dmlFromUrl(scriptUrl).in("A", habermanRDD, habermanMetadata).in("K", typesRDD, typesMetadata).in("$CONSOLE_OUTPUT", true) -uni: org.apache.sysml.api.mlcontext.Script = +uni: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) A: ParallelCollectionRDD[159] at parallelize at <console>:43 [2] (RDD) K: ParallelCollectionRDD[160] at parallelize at <console>:39 @@ -653,7 +653,7 @@ Feature [4]: Categorical (Nominal) (15) Num of categories | 2 (16) Mode | 1 (17) Num of modes | 1 -res23: org.apache.sysml.api.mlcontext.MLResults = +res23: org.apache.sysml.api.mlcontext.MLResults = None {% endhighlight %} @@ -723,7 +723,7 @@ baseStats.asRDDStringIJV.collect.slice(0,9).foreach(println) <div data-lang="Spark Shell" markdown="1"> {% highlight scala %} scala> val uni = dmlFromUrl(scriptUrl).in("A", habermanRDD, habermanMetadata).in("K", typesRDD, typesMetadata).out("baseStats") -uni: org.apache.sysml.api.mlcontext.Script = +uni: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) A: ParallelCollectionRDD[159] at parallelize at <console>:43 [2] (RDD) K: ParallelCollectionRDD[160] at parallelize at <console>:39 @@ -783,7 +783,7 @@ scala> val minMaxMean = | maxOut = max(Xin) | meanOut = mean(Xin) | """ -minMaxMean: String = +minMaxMean: String = " minOut = min(Xin) maxOut = max(Xin) @@ -937,7 +937,7 @@ scala> val minMaxMean = | maxOut = max(Xin) | meanOut = mean(Xin) | """ -minMaxMean: String = +minMaxMean: String = " minOut = min(Xin) maxOut = max(Xin) @@ -1023,7 +1023,7 @@ scala> val minMaxMean = | maxOut = max(Xin) | meanOut = mean(Xin) | """ -minMaxMean: String = +minMaxMean: String = " minOut = min(Xin) maxOut = max(Xin) @@ -1186,7 +1186,7 @@ scala> class MyScriptExecutor extends org.apache.sysml.api.mlcontext.ScriptExecu defined class MyScriptExecutor scala> val helloScript = dml("print('hello world')") -helloScript: org.apache.sysml.api.mlcontext.Script = +helloScript: org.apache.sysml.api.mlcontext.Script = Inputs: None @@ -1197,7 +1197,7 @@ scala> ml.execute(helloScript, new MyScriptExecutor) Parsing script Validating script hello world -res63: org.apache.sysml.api.mlcontext.MLResults = +res63: org.apache.sysml.api.mlcontext.MLResults = None {% endhighlight %} @@ -1242,7 +1242,7 @@ scala> val rddCSV = sc.parallelize(Array("1.0,2.0", "3.0,4.0")) rddCSV: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[190] at parallelize at <console>:38 scala> val sumAndMean = dml("sum = sum(m); mean = mean(m)").in("m", rddCSV).out("sum", "mean") -sumAndMean: org.apache.sysml.api.mlcontext.Script = +sumAndMean: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m: ParallelCollectionRDD[190] at parallelize at <console>:38 @@ -1251,7 +1251,7 @@ Outputs: [2] mean scala> ml.execute(sumAndMean) -res20: org.apache.sysml.api.mlcontext.MLResults = +res20: org.apache.sysml.api.mlcontext.MLResults = [1] (Double) sum: 10.0 [2] (Double) mean: 2.5 @@ -1291,7 +1291,7 @@ scala> val mm3x3 = new MatrixMetadata(MatrixFormat.IJV, 3, 3) mm3x3: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 3, columns: 3, non-zeros: None, rows per block: None, columns per block: None scala> val sumAndMean = dml("sum = sum(m); mean = mean(m)").in("m", rddIJV, mm3x3).out("sum", "mean") -sumAndMean: org.apache.sysml.api.mlcontext.Script = +sumAndMean: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m: ParallelCollectionRDD[202] at parallelize at <console>:38 @@ -1300,7 +1300,7 @@ Outputs: [2] mean scala> ml.execute(sumAndMean) -res21: org.apache.sysml.api.mlcontext.MLResults = +res21: org.apache.sysml.api.mlcontext.MLResults = [1] (Double) sum: 10.0 [2] (Double) mean: 1.1111111111111112 @@ -1333,7 +1333,7 @@ scala> val mm4x4 = new MatrixMetadata(MatrixFormat.IJV, 4, 4) mm4x4: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 4, columns: 4, non-zeros: None, rows per block: None, columns per block: None scala> val sumAndMean = dml("sum = sum(m); mean = mean(m)").in("m", rddIJV, mm4x4).out("sum", "mean") -sumAndMean: org.apache.sysml.api.mlcontext.Script = +sumAndMean: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) m: ParallelCollectionRDD[210] at parallelize at <console>:38 @@ -1342,7 +1342,7 @@ Outputs: [2] mean scala> ml.execute(sumAndMean) -res22: org.apache.sysml.api.mlcontext.MLResults = +res22: org.apache.sysml.api.mlcontext.MLResults = [1] (Double) sum: 10.0 [2] (Double) mean: 0.625 @@ -1445,7 +1445,7 @@ scala> val rddCSV = sc.parallelize(Array("1.0,2.0", "3.0,4.0")) rddCSV: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[341] at parallelize at <console>:53 scala> val add = dml("y = x + 1").in("x", rddCSV).out("y") -add: org.apache.sysml.api.mlcontext.Script = +add: org.apache.sysml.api.mlcontext.Script = Inputs: [1] (RDD) x: ParallelCollectionRDD[341] at parallelize at <console>:53 @@ -2100,3 +2100,8 @@ plt.title('PNMF Training Loss') {% endhighlight %}  + +# Recommended Spark Configuration Settings + +For best performance, we recommend setting the following flags when running SystemML with Spark: +`--conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128`.
