svn commit: r1828046 [6/20] - /systemml/site/docs/1.1.0/

reinwald Thu, 29 Mar 2018 21:31:48 -0700

Added: systemml/site/docs/1.1.0/algorithms-regression.html
URL: 
http://svn.apache.org/viewvc/systemml/site/docs/1.1.0/algorithms-regression.html?rev=1828046&view=auto
==============================================================================
--- systemml/site/docs/1.1.0/algorithms-regression.html (added)
+++ systemml/site/docs/1.1.0/algorithms-regression.html Fri Mar 30 04:31:05 2018
@@ -0,0 +1,2759 @@
+<!DOCTYPE html>
+<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
+<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
+<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
+    <head>
+        <title>SystemML Algorithms Reference - Regression - SystemML 
1.1.0</title>
+        <meta charset="utf-8">
+        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
+        
+        <meta name="viewport" content="width=device-width">
+        <link rel="stylesheet" href="css/bootstrap.min.css">
+        <link rel="stylesheet" href="css/main.css">
+        <link rel="stylesheet" href="css/pygments-default.css">
+        <link rel="shortcut icon" href="img/favicon.png">
+    </head>
+    <body>
+        <!--[if lt IE 7]>
+            <p class="chromeframe">You are using an outdated browser. <a 
href="http://browsehappy.com/";>Upgrade your browser today</a> or <a 
href="http://www.google.com/chromeframe/?redirect=true";>install Google Chrome 
Frame</a> to better experience this site.</p>
+        <![endif]-->
+
+        <header class="navbar navbar-default navbar-fixed-top" id="topbar">
+            <div class="container">
+                <div class="navbar-header">
+                    <div class="navbar-brand brand projectlogo">
+                        <a href="http://systemml.apache.org/";><img 
class="logo" src="img/systemml-logo.png" alt="Apache SystemML" title="Apache 
SystemML"/></a>
+                    </div>
+                    <div class="navbar-brand brand projecttitle">
+                        <a href="http://systemml.apache.org/";>Apache 
SystemML<sup id="trademark">â¢</sup></a><br/>
+                        <span class="version">1.1.0</span>
+                    </div>
+                    <button type="button" class="navbar-toggle collapsed" 
data-toggle="collapse" data-target=".navbar-collapse">
+                        <span class="sr-only">Toggle navigation</span>
+                        <span class="icon-bar"></span>
+                        <span class="icon-bar"></span>
+                        <span class="icon-bar"></span>
+                    </button>
+                </div>
+                <nav class="navbar-collapse collapse">
+                    <ul class="nav navbar-nav navbar-right">
+                        <li><a href="index.html">Overview</a></li>
+                        <li><a 
href="https://github.com/apache/systemml";>GitHub</a></li>
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Documentation<b class="caret"></b></a>
+                            <ul class="dropdown-menu" role="menu">
+                                <li><b>Running SystemML:</b></li>
+                                <li><a 
href="https://github.com/apache/systemml";>SystemML GitHub README</a></li>
+                                <li><a 
href="spark-mlcontext-programming-guide.html">Spark MLContext</a></li>
+                                <li><a href="spark-batch-mode.html">Spark 
Batch Mode</a>
+                                <li><a href="hadoop-batch-mode.html">Hadoop 
Batch Mode</a>
+                                <li><a href="standalone-guide.html">Standalone 
Guide</a></li>
+                                <li><a href="jmlc.html">Java Machine Learning 
Connector (JMLC)</a>
+                                <li class="divider"></li>
+                                <li><b>Language Guides:</b></li>
+                                <li><a href="dml-language-reference.html">DML 
Language Reference</a></li>
+                                <li><a 
href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and 
PyDML</a></li>
+                                <li><a 
href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li>
+                                <li><a href="python-reference.html">Reference 
Guide for Python Users</a></li>
+                                <li class="divider"></li>
+                                <li><b>ML Algorithms:</b></li>
+                                <li><a 
href="algorithms-reference.html">Algorithms Reference</a></li>
+                                <li class="divider"></li>
+                                <li><b>Tools:</b></li>
+                                <li><a href="debugger-guide.html">Debugger 
Guide</a></li>
+                                <li><a 
href="developer-tools-systemml.html">IDE Guide</a></li>
+                                <li class="divider"></li>
+                                <li><b>Other:</b></li>
+                                <li><a 
href="contributing-to-systemml.html">Contributing to SystemML</a></li>
+                                <li><a href="engine-dev-guide.html">Engine 
Developer Guide</a></li>
+                                <li><a 
href="troubleshooting-guide.html">Troubleshooting Guide</a></li>
+                                <li><a href="release-process.html">Release 
Process</a></li>
+                            </ul>
+                        </li>
+                        
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">API Docs<b class="caret"></b></a>
+                            <ul class="dropdown-menu" role="menu">
+                                <li><a 
href="./api/java/index.html">Java</a></li>
+                                <li><a 
href="./api/python/index.html">Python</a></li>
+                            </ul>
+                        </li>
+                        
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Issues<b class="caret"></b></a>
+                            <ul class="dropdown-menu" role="menu">
+                                <li><b>JIRA:</b></li>
+                                <li><a 
href="https://issues.apache.org/jira/browse/SYSTEMML";>SystemML JIRA</a></li>
+                                
+                            </ul>
+                        </li>
+                    </ul>
+                </nav>
+            </div>
+        </header>
+
+        <div class="container" id="content">
+          
+            <h1 class="title"><a href="algorithms-reference.html">SystemML 
Algorithms Reference</a></h1>
+          
+
+          <!--
+
+-->
+
+<h1 id="regression">4. Regression</h1>
+
+<h2 id="linear-regression">4.1. Linear Regression</h2>
+
+<h3 id="description">Description</h3>
+
+<p>Linear Regression scripts are used to model the relationship between one
+numerical response variable and one or more explanatory (feature)
+variables. The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$
+where $x_i$ is a numerical vector of feature variables and $y_i$ is a
+numerical response value for each training data record. The feature
+vectors are provided as a matrix $X$ of size $n\,{\times}\,m$, where $n$
+is the number of records and $m$ is the number of features. The observed
+response values are provided as a 1-column matrixÂ $Y$, with a numerical
+value $y_i$ for eachÂ $x_i$ in the corresponding row of matrixÂ $X$.</p>
+
+<p>In linear regression, we predict the distribution of the responseÂ $y_i$
+based on a fixed linear combination of the features inÂ $x_i$. We assume
+that there exist constant regression coefficients
+$\beta_0, \beta_1, \ldots, \beta_m$ and a constant residual
+varianceÂ $\sigma^2$ such that</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+y_i \sim Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
+\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
+\end{equation}</script>
+
+<p>Distribution
+$y_i \sim Normal(\mu_i, \sigma^2)$
+models the &#8220;unexplained&#8221; residual noise and is assumed independent
+across different records.</p>
+
+<p>The goal is to estimate the regression coefficients and the residual
+variance. Once they are accurately estimated, we can make predictions
+about $y_i$ givenÂ $x_i$ in new records. We can also use the $\beta_j$âs
+to analyze the influence of individual features on the response value,
+and assess the quality of this model by comparing residual variance in
+the response, left after prediction, with its total variance.</p>
+
+<p>There are two scripts in our library, both doing the same estimation,
+but using different computational methods. Depending on the size and the
+sparsity of the feature matrixÂ $X$, one or the other script may be more
+efficient. The &#8220;direct solve&#8221; script <code>LinearRegDS.dml</code> 
is more
+efficient when the number of features $m$ is relatively small
+($m \sim 1000$ or less) and matrixÂ $X$ is either tall or fairly dense
+(hasÂ ${\gg}:m^2$ nonzeros); otherwise, the &#8220;conjugate gradient&#8221; 
script
+<code>LinearRegCG.dml</code> is more efficient. If $m &gt; 50000$, use only
+<code>LinearRegCG.dml</code>.</p>
+
+<h3 id="usage">Usage</h3>
+
+<p><strong>Linear Regression - Direct Solve</strong>:</p>
+
+<div class="codetabs">
+<div data-lang="Python">
+
+    <div class="highlight"><pre><code class="language-python" 
data-lang="python"><span class="kn">from</span> <span 
class="nn">systemml.mllearn</span> <span class="kn">import</span> <span 
class="n">LinearRegression</span>
+<span class="c"># C = 1/reg (to disable regularization, use 
float("inf"))</span>
+<span class="n">lr</span> <span class="o">=</span> <span 
class="n">LinearRegression</span><span class="p">(</span><span 
class="n">sqlCtx</span><span class="p">,</span> <span 
class="n">fit_intercept</span><span class="o">=</span><span 
class="bp">True</span><span class="p">,</span> <span 
class="n">normalize</span><span class="o">=</span><span 
class="bp">False</span><span class="p">,</span> <span class="n">C</span><span 
class="o">=</span><span class="nb">float</span><span class="p">(</span><span 
class="s">"inf"</span><span class="p">),</span> <span 
class="n">solver</span><span class="o">=</span><span 
class="s">'direct-solve'</span><span class="p">)</span>
+<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas 
DataFrame or SciPy Sparse Matrix</span>
+<span class="n">y_test</span> <span class="o">=</span> <span 
class="n">lr</span><span class="o">.</span><span class="n">fit</span><span 
class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span 
class="n">y_train</span><span class="p">)</span>
+<span class="c"># df_train is DataFrame that contains two columns: "features" 
(of type Vector) and "label". df_test is a DataFrame that contains the column 
"features"</span>
+<span class="n">y_test</span> <span class="o">=</span> <span 
class="n">lr</span><span class="o">.</span><span class="n">fit</span><span 
class="p">(</span><span class="n">df_train</span><span 
class="p">)</span></code></pre></div>
+
+  </div>
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml
+                        -nvargs X=&lt;file&gt;
+                                Y=&lt;file&gt;
+                                B=&lt;file&gt;
+                                O=[file]
+                                icpt=[int]
+                                reg=[double]
+                                fmt=[format]
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f LinearRegDS.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=&lt;file&gt;
+                                     Y=&lt;file&gt;
+                                     B=&lt;file&gt;
+                                     O=[file]
+                                     icpt=[int]
+                                     reg=[double]
+                                     fmt=[format]
+</code></pre>
+  </div>
+</div>
+
+<p><strong>Linear Regression - Conjugate Gradient</strong>:</p>
+
+<div class="codetabs">
+<div data-lang="Python">
+
+    <div class="highlight"><pre><code class="language-python" 
data-lang="python"><span class="kn">from</span> <span 
class="nn">systemml.mllearn</span> <span class="kn">import</span> <span 
class="n">LinearRegression</span>
+<span class="c"># C = 1/reg (to disable regularization, use 
float("inf"))</span>
+<span class="n">lr</span> <span class="o">=</span> <span 
class="n">LinearRegression</span><span class="p">(</span><span 
class="n">sqlCtx</span><span class="p">,</span> <span 
class="n">fit_intercept</span><span class="o">=</span><span 
class="bp">True</span><span class="p">,</span> <span 
class="n">normalize</span><span class="o">=</span><span 
class="bp">False</span><span class="p">,</span> <span 
class="n">max_iter</span><span class="o">=</span><span 
class="mi">100</span><span class="p">,</span> <span class="n">tol</span><span 
class="o">=</span><span class="mf">0.000001</span><span class="p">,</span> 
<span class="n">C</span><span class="o">=</span><span 
class="nb">float</span><span class="p">(</span><span 
class="s">"inf"</span><span class="p">),</span> <span 
class="n">solver</span><span class="o">=</span><span 
class="s">'newton-cg'</span><span class="p">)</span>
+<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas 
DataFrames or SciPy Sparse matrices</span>
+<span class="n">y_test</span> <span class="o">=</span> <span 
class="n">lr</span><span class="o">.</span><span class="n">fit</span><span 
class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span 
class="n">y_train</span><span class="p">)</span>
+<span class="c"># df_train is DataFrame that contains two columns: "features" 
(of type Vector) and "label". df_test is a DataFrame that contains the column 
"features"</span>
+<span class="n">y_test</span> <span class="o">=</span> <span 
class="n">lr</span><span class="o">.</span><span class="n">fit</span><span 
class="p">(</span><span class="n">df_train</span><span 
class="p">)</span></code></pre></div>
+
+  </div>
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml
+                        -nvargs X=&lt;file&gt;
+                                Y=&lt;file&gt;
+                                B=&lt;file&gt;
+                                O=[file]
+                                Log=[file]
+                                icpt=[int]
+                                reg=[double]
+                                tol=[double]
+                                maxi=[int]
+                                fmt=[format]
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f LinearRegCG.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=&lt;file&gt;
+                                     Y=&lt;file&gt;
+                                     B=&lt;file&gt;
+                                     O=[file]
+                                     Log=[file]
+                                     icpt=[int]
+                                     reg=[double]
+                                     tol=[double]
+                                     maxi=[int]
+                                     fmt=[format]
+</code></pre>
+  </div>
+</div>
+
+<h3 id="arguments-for-spark-and-hadoop-invocation">Arguments for Spark and 
Hadoop invocation</h3>
+
+<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature 
vectors, each row
+constitutes one feature vector</p>
+
+<p><strong>Y</strong>: Location to read the 1-column matrix of response 
values</p>
+
+<p><strong>B</strong>: Location to store the estimated regression parameters 
(the $\beta_j$âs),
+with the intercept parameterÂ $\beta_0$ at position
+B[$m\,{+}\,1$, 1] if available</p>
+
+<p><strong>O</strong>: (default: <code>" "</code>) Location to store the 
CSV-file of summary statistics defined
+in <a href="algorithms-regression.html#table7"><strong>TableÂ 7</strong></a>, 
the default is to print it to the
+standard output</p>
+
+<p><strong>Log</strong>: (default: <code>" "</code>, 
<code>LinearRegCG.dml</code> only) Location to store
+iteration-specific variables for monitoring and debugging purposes, see
+<a href="algorithms-regression.html#table8"><strong>TableÂ 8</strong></a>
+for details.</p>
+
+<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and 
shifting/rescaling the features
+inÂ $X$:</p>
+
+<ul>
+  <li>0 = no intercept (hence noÂ $\beta_0$), no shifting or
+rescaling of the features</li>
+  <li>1 = add intercept, but do not shift/rescale the features
+inÂ $X$</li>
+  <li>2 = add intercept, shift/rescale the features inÂ $X$ to
+meanÂ 0, varianceÂ 1</li>
+</ul>
+
+<p><strong>reg</strong>: (default: <code>0.000001</code>) L2-regularization 
parameterÂ $\lambda\geq 0$; set to nonzero for highly
+dependent, sparse, or numerous ($m \gtrsim n/10$) features</p>
+
+<p><strong>tol</strong>: (default: <code>0.000001</code>, 
<code>LinearRegCG.dml</code> only) Tolerance $\varepsilon\geq 0$ used in the
+convergence criterion: we terminate conjugate gradient iterations when
+the $\beta$-residual reduces in L2-norm by this factor</p>
+
+<p><strong>maxi</strong>: (default: <code>0</code>, 
<code>LinearRegCG.dml</code> only) Maximum number of conjugate
+gradient iterations, orÂ <code>0</code> if no maximum limit provided</p>
+
+<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output 
format, such as <code>text</code>,
+<code>mm</code>, or <code>csv</code>; see read/write functions in
+SystemML Language Reference for details.</p>
+
+<p>Please see <a 
href="https://apache.github.io/systemml/python-reference#mllearn-api";>mllearn 
documentation</a> for
+more details on the Python API.</p>
+
+<h3 id="examples">Examples</h3>
+
+<p><strong>Linear Regression - Direct Solve</strong>:</p>
+
+<div class="codetabs">
+<div data-lang="Python">
+
+    <div class="highlight"><pre><code class="language-python" 
data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> 
<span class="k">as</span> <span class="n">np</span>
+<span class="kn">from</span> <span class="nn">sklearn</span> <span 
class="kn">import</span> <span class="n">datasets</span>
+<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span 
class="kn">import</span> <span class="n">LinearRegression</span>
+<span class="c"># Load the diabetes dataset</span>
+<span class="n">diabetes</span> <span class="o">=</span> <span 
class="n">datasets</span><span class="o">.</span><span 
class="n">load_diabetes</span><span class="p">()</span>
+<span class="c"># Use only one feature</span>
+<span class="n">diabetes_X</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span 
class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span 
class="mi">2</span><span class="p">]</span>
+<span class="c"># Split the data into training/testing sets</span>
+<span class="n">diabetes_X_train</span> <span class="o">=</span> <span 
class="n">diabetes_X</span><span class="p">[:</span><span 
class="o">-</span><span class="mi">20</span><span class="p">]</span>
+<span class="n">diabetes_X_test</span> <span class="o">=</span> <span 
class="n">diabetes_X</span><span class="p">[</span><span 
class="o">-</span><span class="mi">20</span><span class="p">:]</span>
+<span class="c"># Split the targets into training/testing sets</span>
+<span class="n">diabetes_y_train</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">target</span><span class="p">[:</span><span class="o">-</span><span 
class="mi">20</span><span class="p">]</span>
+<span class="n">diabetes_y_test</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">target</span><span class="p">[</span><span class="o">-</span><span 
class="mi">20</span><span class="p">:]</span>
+<span class="c"># Create linear regression object</span>
+<span class="n">regr</span> <span class="o">=</span> <span 
class="n">LinearRegression</span><span class="p">(</span><span 
class="n">spark</span><span class="p">,</span> <span 
class="n">solver</span><span class="o">=</span><span 
class="s">'direct-solve'</span><span class="p">)</span>
+<span class="c"># Train the model using the training sets</span>
+<span class="n">regr</span><span class="o">.</span><span 
class="n">fit</span><span class="p">(</span><span 
class="n">diabetes_X_train</span><span class="p">,</span> <span 
class="n">diabetes_y_train</span><span class="p">)</span>
+<span class="c"># The mean square error</span>
+<span class="k">print</span><span class="p">(</span><span class="s">"Residual 
sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> 
<span class="o">%</span> <span class="n">np</span><span class="o">.</span><span 
class="n">mean</span><span class="p">((</span><span class="n">regr</span><span 
class="o">.</span><span class="n">predict</span><span class="p">(</span><span 
class="n">diabetes_X_test</span><span class="p">)</span> <span 
class="o">-</span> <span class="n">diabetes_y_test</span><span 
class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span 
class="p">))</span></code></pre></div>
+
+  </div>
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml
+                        -nvargs X=/user/ml/X.mtx
+                                Y=/user/ml/Y.mtx
+                                B=/user/ml/B.mtx
+                                fmt=csv
+                                O=/user/ml/stats.csv
+                                icpt=2
+                                reg=1.0
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f LinearRegDS.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=/user/ml/X.mtx
+                                     Y=/user/ml/Y.mtx
+                                     B=/user/ml/B.mtx
+                                     fmt=csv
+                                     O=/user/ml/stats.csv
+                                     icpt=2
+                                     reg=1.0
+</code></pre>
+  </div>
+</div>
+
+<p><strong>Linear Regression - Conjugate Gradient</strong>:</p>
+
+<div class="codetabs">
+<div data-lang="Python">
+
+    <div class="highlight"><pre><code class="language-python" 
data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> 
<span class="k">as</span> <span class="n">np</span>
+<span class="kn">from</span> <span class="nn">sklearn</span> <span 
class="kn">import</span> <span class="n">datasets</span>
+<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span 
class="kn">import</span> <span class="n">LinearRegression</span>
+<span class="c"># Load the diabetes dataset</span>
+<span class="n">diabetes</span> <span class="o">=</span> <span 
class="n">datasets</span><span class="o">.</span><span 
class="n">load_diabetes</span><span class="p">()</span>
+<span class="c"># Use only one feature</span>
+<span class="n">diabetes_X</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span 
class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span 
class="mi">2</span><span class="p">]</span>
+<span class="c"># Split the data into training/testing sets</span>
+<span class="n">diabetes_X_train</span> <span class="o">=</span> <span 
class="n">diabetes_X</span><span class="p">[:</span><span 
class="o">-</span><span class="mi">20</span><span class="p">]</span>
+<span class="n">diabetes_X_test</span> <span class="o">=</span> <span 
class="n">diabetes_X</span><span class="p">[</span><span 
class="o">-</span><span class="mi">20</span><span class="p">:]</span>
+<span class="c"># Split the targets into training/testing sets</span>
+<span class="n">diabetes_y_train</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">target</span><span class="p">[:</span><span class="o">-</span><span 
class="mi">20</span><span class="p">]</span>
+<span class="n">diabetes_y_test</span> <span class="o">=</span> <span 
class="n">diabetes</span><span class="o">.</span><span 
class="n">target</span><span class="p">[</span><span class="o">-</span><span 
class="mi">20</span><span class="p">:]</span>
+<span class="c"># Create linear regression object</span>
+<span class="n">regr</span> <span class="o">=</span> <span 
class="n">LinearRegression</span><span class="p">(</span><span 
class="n">spark</span><span class="p">,</span> <span 
class="n">solver</span><span class="o">=</span><span 
class="s">'newton-cg'</span><span class="p">)</span>
+<span class="c"># Train the model using the training sets</span>
+<span class="n">regr</span><span class="o">.</span><span 
class="n">fit</span><span class="p">(</span><span 
class="n">diabetes_X_train</span><span class="p">,</span> <span 
class="n">diabetes_y_train</span><span class="p">)</span>
+<span class="c"># The mean square error</span>
+<span class="k">print</span><span class="p">(</span><span class="s">"Residual 
sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> 
<span class="o">%</span> <span class="n">np</span><span class="o">.</span><span 
class="n">mean</span><span class="p">((</span><span class="n">regr</span><span 
class="o">.</span><span class="n">predict</span><span class="p">(</span><span 
class="n">diabetes_X_test</span><span class="p">)</span> <span 
class="o">-</span> <span class="n">diabetes_y_test</span><span 
class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span 
class="p">))</span></code></pre></div>
+
+  </div>
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml
+                        -nvargs X=/user/ml/X.mtx
+                                Y=/user/ml/Y.mtx
+                                B=/user/ml/B.mtx
+                                fmt=csv
+                                O=/user/ml/stats.csv
+                                icpt=2
+                                reg=1.0
+                                tol=0.00000001
+                                maxi=100
+                                Log=/user/ml/log.csv
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f LinearRegCG.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=/user/ml/X.mtx
+                                     Y=/user/ml/Y.mtx
+                                     B=/user/ml/B.mtx
+                                     fmt=csv
+                                     O=/user/ml/stats.csv
+                                     icpt=2
+                                     reg=1.0
+                                     tol=0.00000001
+                                     maxi=100
+                                     Log=/user/ml/log.csv
+</code></pre>
+  </div>
+</div>
+
+<hr />
+
+<p><a name="table7"></a>
+<strong>Table 7</strong>: Besides $\beta$, linear regression scripts compute a 
few summary statistics
+listed below.  The statistics are provided in CSV format, one comma-separated 
name-value
+pair per each line.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Meaning</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>AVG_TOT_Y</td>
+      <td>Average of the response value $Y$</td>
+    </tr>
+    <tr>
+      <td>STDEV_TOT_Y</td>
+      <td>Standard Deviation of the response value $Y$</td>
+    </tr>
+    <tr>
+      <td>AVG_RES_Y</td>
+      <td>Average of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$, i.e. 
residual bias</td>
+    </tr>
+    <tr>
+      <td>STDEV_RES_Y</td>
+      <td>Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y 
\mid X)$</td>
+    </tr>
+    <tr>
+      <td>DISPERSION</td>
+      <td>GLM-style dispersion, i.e. residual sum of squares / #deg. fr.</td>
+    </tr>
+    <tr>
+      <td>R2</td>
+      <td>$R^2$ of residual with bias included vs. total average</td>
+    </tr>
+    <tr>
+      <td>ADJUSTED_R2</td>
+      <td>Adjusted $R^2$ of residual with bias included vs. total average</td>
+    </tr>
+    <tr>
+      <td>R2_NOBIAS</td>
+      <td>Plain $R^2$ of residual with bias subtracted vs. total average</td>
+    </tr>
+    <tr>
+      <td>ADJUSTED_R2_NOBIAS</td>
+      <td>Adjusted $R^2$ of residual with bias subtracted vs. total 
average</td>
+    </tr>
+    <tr>
+      <td>R2_VS_0</td>
+      <td>* $R^2$ of residual with bias included vs. zero constant</td>
+    </tr>
+    <tr>
+      <td>ADJUSTED_R2_VS_0</td>
+      <td>* Adjusted $R^2$ of residual with bias included vs. zero 
constant</td>
+    </tr>
+  </tbody>
+</table>
+
+<p>* The last two statistics are only printed if there is no intercept 
(<code>icpt=0</code>)</p>
+
+<hr />
+
+<p><a name="table8"></a>
+<strong>Table 8</strong>: The <code>Log</code> file for the 
<code>LinearRegCG.dml</code> script
+contains the above iteration variables in CSV format, each line
+containing a triple (Name, Iteration#, Value) with Iteration# beingÂ 0
+for initial values.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Meaning</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>CG_RESIDUAL_NORM</td>
+      <td>L2-norm of conjug.Â grad.Â residual, which is $A$ %*% $\beta - t(X)$ 
%*% $y$ where $A = t(X)$ %*% $X + diag(\lambda)$, or a similar quantity</td>
+    </tr>
+    <tr>
+      <td>CG_RESIDUAL_RATIO</td>
+      <td>Ratio of current L2-norm of conjug.Â grad.Â residual over the 
initial</td>
+    </tr>
+  </tbody>
+</table>
+
+<hr />
+
+<h3 id="details">Details</h3>
+
+<p>To solve a linear regression problem over feature matrixÂ $X$ and
+response vectorÂ $Y$, we can find coefficients
+$\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize the
+joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed
+statistical modelÂ (1). Since the joint likelihood of the
+independent
+$y_i \sim Normal(\mu_i, \sigma^2)$
+is proportional to the product of
+$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the
+logarithm of this product, then multiply by $-2\sigma^2 &lt; 0$ to obtain a
+least squares problem:</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, 
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,\,\to\,\,\min
+\end{equation}</script>
+
+<p>This may not be enough, however. The minimum may
+sometimes be attained over infinitely many $\beta$-vectors, for example
+if $X$ has an all-0 column, or has linearly dependent columns, or has
+fewer rows than columnsÂ . Even ifÂ (2) has a unique
+solution, other $\beta$-vectors may be just a little suboptimal<sup 
id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, yet
+give significantly different predictions for new feature vectors. This
+results in <em>overfitting</em>: prediction error for the training data ($X$
+andÂ $Y$) is much smaller than for the test data (new records).</p>
+
+<p>Overfitting and degeneracy in the data is commonly mitigated by adding a
+regularization penalty term to the least squares function:</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
+\,\,\to\,\,\min
+\end{equation}</script>
+
+<p>The choice of $\lambda&gt;0$, the regularization
+constant, typically involves cross-validation where the dataset is
+repeatedly split into a training part (to estimate theÂ $\beta_j$âs) and
+a test part (to evaluate prediction accuracy), with the goal of
+maximizing the test accuracy. In our scripts, $\lambda$Â is provided as
+input parameterÂ <code>reg</code>.</p>
+
+<p>The solution to the least squares problemÂ (3), through
+taking the derivative and setting it toÂ 0, has the matrix linear
+equation form</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, 
\big[X,\,1\big]^T Y,\,\,\,
+\textrm{where}\,\,\,
+A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} 
diag(\hspace{0.5pt}
+\underbrace{\lambda,\ldots, \lambda}_{\scriptstyle m}, 0)
+\end{equation}</script>
+
+<p>where $[X,\,1]$ is $X$Â with an extra column ofÂ 1s
+appended on the right, and the diagonal matrix of $\lambda$âs has a zero
+to keep the interceptÂ $\beta_0$ unregularized. If the intercept is
+disabled by setting $icpt=0$, the equation is simply $X^T X \beta = X^T Y$.</p>
+
+<p>We implemented two scripts for solving equationÂ (4): one
+is a &#8220;direct solver&#8221; that computes $A$ and then solves
+$A\beta = [X,\,1]^T Y$ by calling an external package, the other
+performs linear conjugate gradientÂ (CG) iterations without ever
+materializingÂ $A$. The CGÂ algorithm closely follows AlgorithmÂ 5.2 in
+ChapterÂ 5 ofÂ <a href="algorithms-bibliography.html">[Nocedal2006]</a>. Each 
step in the CGÂ algorithm
+computes a matrix-vector multiplication $q = Ap$ by first computing
+$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of
+such multiplications, one per CG iteration, is much smaller thanÂ $m$.
+The user can put a hard bound on it with input
+parameterÂ <code>maxi</code>, or use the default maximum ofÂ $m+1$ (orÂ $m$
+if no intercept) by having <code>maxi=0</code>. The CGÂ iterations
+terminate when the L2-norm of vector $r = A\beta - [X,\,1]^T Y$
+decreases from its initial value (forÂ $\beta=0$) by the tolerance factor
+specified in input parameterÂ <code>tol</code>.</p>
+
+<p>The CG algorithm is more efficient if computing
+$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
+an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct SolverÂ (DS) is more
+efficient if $X$ takes up a lot more memory than $A$ (i.e.Â $X$Â has a lot
+more nonzeros thanÂ $m^2$) and if $m^2$ is small enough for the external
+solver ($m \lesssim 50000$). A more precise determination between CG
+andÂ DS is subject to further research.</p>
+
+<p>In addition to the $\beta$-vector, the scripts estimate the residual
+standard deviationÂ $\sigma$ and theÂ $R^2$, the ratio of 
&#8220;explained&#8221;
+variance to the total variance of the response variable. These
+statistics only make sense if the number of degrees of freedom
+$n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
+$\lambda$ is negligible or zero. The formulas for $\sigma$ and
+$R^2$Â are:</p>
+
+<script type="math/tex; mode=display">R^2 = 1 - 
\frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
+\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
+R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}</script>
+
+<p>where</p>
+
+<script type="math/tex; mode=display">\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i 
- \hat{\mu}_i - 
+\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
+\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n 
y_{i'}\Big)^2</script>
+
+<p>Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the
+estimated regression coefficients and the feature vectors. They may be
+biased when no intercept is present, hence the RSS formula subtracts the
+bias.</p>
+
+<p>Lastly, note that by choosing the input option <code>icpt=2</code> the
+user can shift and rescale the columns ofÂ $X$ to have zero average and
+the variance ofÂ 1. This is particularly important when using
+regularization over highly disbalanced features, because regularization
+tends to penalize small-variance columns (which need largeÂ $\beta_j$âs)
+more than large-variance columns (with smallÂ $\beta_j$âs). At the end,
+the estimated regression coefficients are shifted and rescaled to apply
+to the original features.</p>
+
+<h3 id="returns">Returns</h3>
+
+<p>The estimated regression coefficients (the $\hat{\beta}_j$âs) are
+populated into a matrix and written to an HDFS file whose path/name was
+provided as the <code>B</code> input argument. What this matrix
+contains, and its size, depends on the input argument <code>icpt</code>,
+which specifies the userâs intercept and rescaling choice:</p>
+
+<ul>
+  <li><strong>icpt=0</strong>: No intercept, matrixÂ $B$ has size 
$m\,{\times}\,1$, with
+$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 toÂ $m = {}$ncol$(X)$.</li>
+  <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling 
ofÂ $X$; matrixÂ $B$ has
+size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from
+1 toÂ $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept
+coefficient.</li>
+  <li><strong>icpt=2</strong>: There is intercept, and the features inÂ $X$ 
are shifted to mean$ = 0$
+and rescaled to variance$ = 1$; then there are two versions of
+theÂ $\hat{\beta}_j$âs, one for the original features and another for the
+shifted/rescaled features. Now matrixÂ $B$ has size
+$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and
+$B[\cdot, 2]$ for the shifted/rescaled features, in the above format.
+Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are
+obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li>
+</ul>
+
+<p>The estimated summary statistics, including residual standard
+deviationÂ $\sigma$ and theÂ $R^2$, are printed out or sent into a file
+(if specified) in CSV format as defined in <a 
href="algorithms-regression.html#table7"><strong>TableÂ 7</strong></a>.
+For conjugate gradient iterations, a log file with monitoring variables
+can also be made available, seeÂ <a 
href="algorithms-regression.html#table8"><strong>TableÂ 8</strong></a>.</p>
+
+<hr />
+
+<h2 id="stepwise-linear-regression">4.2. Stepwise Linear Regression</h2>
+
+<h3 id="description-1">Description</h3>
+
+<p>Our stepwise linear regression script selects a linear model based on
+the Akaike information criterion (AIC): the model that gives rise to the
+lowest AIC is computed.</p>
+
+<h3 id="usage-1">Usage</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml
+                        -nvargs X=&lt;file&gt;
+                                Y=&lt;file&gt;
+                                B=&lt;file&gt;
+                                S=[file]
+                                O=[file]
+                                icpt=[int]
+                                thr=[double]
+                                fmt=[format]
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f StepLinearRegDS.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=&lt;file&gt;
+                                     Y=&lt;file&gt;
+                                     B=&lt;file&gt;
+                                     S=[file]
+                                     O=[file]
+                                     icpt=[int]
+                                     thr=[double]
+                                     fmt=[format]
+</code></pre>
+  </div>
+</div>
+
+<h3 id="arguments-for-spark-and-hadoop-invocation-1">Arguments for Spark and 
Hadoop invocation</h3>
+
+<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature 
vectors, each row
+contains one feature vector.</p>
+
+<p><strong>Y</strong>: Location (on HDFS) to read the 1-column matrix of 
response values</p>
+
+<p><strong>B</strong>: Location (on HDFS) to store the estimated regression 
parameters (the
+$\beta_j$âs), with the intercept parameterÂ $\beta_0$ at position
+B[$m\,{+}\,1$, 1] if available</p>
+
+<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store 
the selected feature-ids in the
+order as computed by the algorithm; by default the selected feature-ids
+are forwarded to the standard output.</p>
+
+<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to store 
the CSV-file of summary
+statistics defined in <a 
href="algorithms-regression.html#table7"><strong>TableÂ 7</strong></a>; by 
default the
+summary statistics are forwarded to the standard output.</p>
+
+<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and 
shifting/rescaling the features
+inÂ $X$:</p>
+
+<ul>
+  <li>0 = no intercept (hence noÂ $\beta_0$), no shifting or
+rescaling of the features;</li>
+  <li>1 = add intercept, but do not shift/rescale the features
+inÂ $X$;</li>
+  <li>2 = add intercept, shift/rescale the features inÂ $X$ to
+meanÂ 0, varianceÂ 1</li>
+</ul>
+
+<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the 
algorithm: if the decrease in the value
+of the AIC falls below <code>thr</code> no further features are being
+checked and the algorithm stops.</p>
+
+<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output 
format, such as <code>text</code>,
+<code>mm</code>, or <code>csv</code>; see read/write functions in
+SystemML Language Reference for details.</p>
+
+<h3 id="examples-1">Examples</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml
+                        -nvargs X=/user/ml/X.mtx
+                                Y=/user/ml/Y.mtx
+                                B=/user/ml/B.mtx
+                                S=/user/ml/selected.csv
+                                O=/user/ml/stats.csv
+                                icpt=2
+                                thr=0.05
+                                fmt=csv
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f StepLinearRegDS.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=/user/ml/X.mtx
+                                     Y=/user/ml/Y.mtx
+                                     B=/user/ml/B.mtx
+                                     S=/user/ml/selected.csv
+                                     O=/user/ml/stats.csv
+                                     icpt=2
+                                     thr=0.05
+                                     fmt=csv
+</code></pre>
+  </div>
+</div>
+
+<h3 id="details-1">Details</h3>
+
+<p>Stepwise linear regression iteratively selects predictive variables in
+an automated procedure. Currently, our implementation supports forward
+selection: starting from an empty model (without any variable) the
+algorithm examines the addition of each variable based on the AIC as a
+model comparison criterion. The AIC is defined as</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
+\end{equation}</script>
+
+<p>where $L$ denotes the
+likelihood of the fitted model and $edf$ is the equivalent degrees of
+freedom, i.e., the number of estimated parameters. This procedure is
+repeated until including no additional variable improves the model by a
+certain threshold specified in the input parameter <code>thr</code>.</p>
+
+<p>For fitting a model in each iteration we use the <code>direct solve</code> 
method
+as in the script <code>LinearRegDS.dml</code> discussed in
+<a href="algorithms-regression.html#linear-regression">Linear 
Regression</a>.</p>
+
+<h3 id="returns-1">Returns</h3>
+
+<p>Similar to the outputs from <code>LinearRegDS.dml</code> the stepwise
+linear regression script computes the estimated regression coefficients
+and stores them in matrix $B$ on HDFS. The format of matrix $B$ is
+identical to the one produced by the scripts for linear regression (see
+<a href="algorithms-regression.html#linear-regression">Linear Regression</a>). 
Additionally, <code>StepLinearRegDS.dml</code>
+outputs the variable indices (stored in the 1-column matrix $S$) in the
+order they have been selected by the algorithm, i.e., $i$th entry in
+matrix $S$ corresponds to the variable which improves the AIC the most
+in $i$th iteration. If the model with the lowest AIC includes no
+variables matrix $S$ will be empty (contains one 0). Moreover, the
+estimated summary statistics as defined in <a 
href="algorithms-regression.html#table7"><strong>TableÂ 7</strong></a>
+are printed out or stored in a file (if requested). In the case where an
+empty model achieves the best AIC these statistics will not be produced.</p>
+
+<hr />
+
+<h2 id="generalized-linear-models">4.3. Generalized Linear Models</h2>
+
+<h3 id="description-2">Description</h3>
+
+<p>Generalized Linear ModelsÂ 
+[<a href="algorithms-bibliography.html">Gill2000</a>,
+<a href="algorithms-bibliography.html">McCullagh1989</a>,
+<a href="algorithms-bibliography.html">Nelder1972</a>]
+extend the methodology of linear
+and logistic regression to a variety of distributions commonly assumed
+as noise effects in the response variable. As before, we are given a
+collection of records $(x_1, y_1)$, â¦, $(x_n, y_n)$ where $x_i$ is a
+numerical vector of explanatory (feature) variables of sizeÂ $\dim x_i = m$, 
and $y_i$
+is the response (dependent) variable observed for this vector. GLMs
+assume that some linear combination of the features inÂ $x_i$ determines
+the <em>mean</em>Â $\mu_i$ ofÂ $y_i$, while the observed $y_i$ is a random
+outcome of a noise distribution
+$Prob[y\mid \mu_i]\,$<sup id="fnref:2"><a href="#fn:2" 
class="footnote">2</a></sup>
+with that meanÂ $\mu_i$:</p>
+
+<script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = 
\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
+\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid 
\mu_i]</script>
+
+<p>In linear regression the response mean $\mu_i$ <em>equals</em> some linear
+combination overÂ $x_i$, denoted above byÂ $\eta_i$. In logistic
+regression with <script type="math/tex">y\in\{0, 1\}</script> (Bernoulli) the 
mean ofÂ $y$ is the same
+as $Prob[y=1]$
+and equals $1/(1+e^{-\eta_i})$, the logistic function ofÂ $\eta_i$. In
+GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone
+function called the <em>link function</em>: $\eta_i = g(\mu_i)$. The unknown
+linear combination parameters $\beta_j$ are assumed to be the same for
+all records.</p>
+
+<p>The goal of the regression is to estimate the parametersÂ $\beta_j$ from
+the observed data. Once theÂ $\beta_j$âs are accurately estimated, we can
+make predictions aboutÂ $y$ for a new feature vectorÂ $x$. To do so,
+compute $\eta$ fromÂ $x$ and use the inverted link function
+$\mu = g^{-1}(\eta)$ to compute the mean $\mu$ ofÂ $y$; then use the
+distribution $Prob[y\mid \mu]$
+to make predictions aboutÂ $y$. Both $g(\mu)$ and
+$Prob[y\mid \mu]$
+are user-provided. Our GLM script supports a standard set of
+distributions and link functions, see below for details.</p>
+
+<h3 id="usage-2">Usage</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f GLM.dml
+                        -nvargs X=&lt;file&gt;
+                                Y=&lt;file&gt;
+                                B=&lt;file&gt;
+                                fmt=[format]
+                                O=[file]
+                                Log=[file]
+                                dfam=[int]
+                                vpow=[double]
+                                link=[int]
+                                lpow=[double]
+                                yneg=[double]
+                                icpt=[int]
+                                reg=[double]
+                                tol=[double]
+                                disp=[double]
+                                moi=[int]
+                                mii=[int]
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f GLM.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=&lt;file&gt;
+                                     Y=&lt;file&gt;
+                                     B=&lt;file&gt;
+                                     fmt=[format]
+                                     O=[file]
+                                     Log=[file]
+                                     dfam=[int]
+                                     vpow=[double]
+                                     link=[int]
+                                     lpow=[double]
+                                     yneg=[double]
+                                     icpt=[int]
+                                     reg=[double]
+                                     tol=[double]
+                                     disp=[double]
+                                     moi=[int]
+                                     mii=[int]
+</code></pre>
+  </div>
+</div>
+
+<h3 id="arguments-for-spark-and-hadoop-invocation-2">Arguments for Spark and 
Hadoop invocation</h3>
+
+<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature 
vectors; each row
+constitutes an example.</p>
+
+<p><strong>Y</strong>: Location to read the response matrix, which may have 1 
or 2 columns</p>
+
+<p><strong>B</strong>: Location to store the estimated regression parameters 
(the $\beta_j$âs),
+with the intercept parameterÂ $\beta_0$ at position
+B[$m\,{+}\,1$, 1] if available</p>
+
+<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output 
format, such as <code>text</code>,
+<code>mm</code>, or <code>csv</code>; see read/write functions in
+SystemML Language Reference for details.</p>
+
+<p><strong>O</strong>: (default: <code>" "</code>) Location to write certain 
summary statistics described 
+in <a href="algorithms-regression.html#table9"><strong>TableÂ 9</strong></a>, 
+by default it is standard output.</p>
+
+<p><strong>Log</strong>: (default: <code>" "</code>) Location to store 
iteration-specific variables for monitoring
+and debugging purposes, see <a 
href="algorithms-regression.html#table10"><strong>TableÂ 10</strong></a> for 
details.</p>
+
+<p><strong>dfam</strong>: (default: <code>1</code>) Distribution family code 
to specify
+$Prob[y\mid \mu]$,
+see <a href="algorithms-regression.html#table11"><strong>TableÂ 
11</strong></a>:</p>
+
+<ul>
+  <li>1 = power distributions with $Var(y) = \mu^{\alpha}$</li>
+  <li>2 = binomial or Bernoulli</li>
+</ul>
+
+<p><strong>vpow</strong>: (default: <code>0.0</code>) When 
<code>dfam=1</code>, this provides theÂ $q$ in
+$Var(y) = a\mu^q$, the power
+dependence of the variance ofÂ $y$ on its mean. In particular, use:</p>
+
+<ul>
+  <li>0.0 = Gaussian</li>
+  <li>1.0 = Poisson</li>
+  <li>2.0 = Gamma</li>
+  <li>3.0 = inverse Gaussian</li>
+</ul>
+
+<p><strong>link</strong>: (default: <code>0</code>) Link function code to 
determine the link
+functionÂ $\eta = g(\mu)$:</p>
+
+<ul>
+  <li>0 = canonical link (depends on the distribution family),
+see <a href="algorithms-regression.html#table11"><strong>TableÂ 
11</strong></a></li>
+  <li>1 = power functions</li>
+  <li>2 = logit</li>
+  <li>3 = probit</li>
+  <li>4 = cloglog</li>
+  <li>5 = cauchit</li>
+</ul>
+
+<p><strong>lpow</strong>: (default: <code>1.0</code>) When link=1, this 
provides theÂ $s$ in
+$\eta = \mu^s$, the power link function; <code>lpow=0.0</code> gives the
+log link $\eta = \log\mu$. Common power links:</p>
+
+<ul>
+  <li>-2.0 = $1/\mu^2$</li>
+  <li>-1.0 = reciprocal</li>
+  <li>0.0 = log</li>
+  <li>0.5 = sqrt</li>
+  <li>1.0 = identity</li>
+</ul>
+
+<p><strong>yneg</strong>: (default: <code>0.0</code>) When <code>dfam=2</code> 
and the response matrix $Y$ has
+1Â column, this specifies the $y$-value used for Bernoulli &#8220;No&#8221; 
label
+(&#8220;Yes&#8221; isÂ $1$):
+0.0 when $y\in\{0, 1\}$; -1.0 when
+$y\in\{-1, 1\}$</p>
+
+<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and 
shifting/rescaling of the features inÂ $X$:</p>
+
+<ul>
+  <li>0 = no intercept (hence noÂ $\beta_0$), no
+shifting/rescaling of the features</li>
+  <li>1 = add intercept, but do not shift/rescale the features
+inÂ $X$</li>
+  <li>2 = add intercept, shift/rescale the features inÂ $X$ to
+meanÂ 0, varianceÂ 1</li>
+</ul>
+
+<p><strong>reg</strong>: (default: <code>0.0</code>) L2-regularization 
parameter ($\lambda$)</p>
+
+<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance 
($\varepsilon$) used in the convergence criterion: we
+terminate the outer iterations when the deviance changes by less than
+this factor; see below for details</p>
+
+<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or 
0.0 to estimate it from
+data</p>
+
+<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer 
(Fisher scoring) iterations</p>
+
+<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner 
(conjugate gradient) iterations, orÂ 0
+if no maximum limit provided</p>
+
+<h3 id="examples-2">Examples</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f GLM.dml
+                        -nvargs X=/user/ml/X.mtx
+                                Y=/user/ml/Y.mtx
+                                B=/user/ml/B.mtx
+                                fmt=csv
+                                dfam=2
+                                link=2
+                                yneg=-1.0
+                                icpt=2
+                                reg=0.01
+                                tol=0.00000001
+                                disp=1.0
+                                moi=100
+                                mii=10
+                                O=/user/ml/stats.csv
+                                Log=/user/ml/log.csv
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f GLM.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=/user/ml/X.mtx
+                                     Y=/user/ml/Y.mtx
+                                     B=/user/ml/B.mtx
+                                     fmt=csv
+                                     dfam=2
+                                     link=2
+                                     yneg=-1.0
+                                     icpt=2
+                                     reg=0.01
+                                     tol=0.00000001
+                                     disp=1.0
+                                     moi=100
+                                     mii=10
+                                     O=/user/ml/stats.csv
+                                     Log=/user/ml/log.csv
+</code></pre>
+  </div>
+</div>
+
+<hr />
+
+<p><a name="table9"></a>
+<strong>Table 9</strong>: BesidesÂ $\beta$, GLM regression script computes a 
few summary
+statistics listed below. They are provided in CSV format, one
+comma-separated name-value pair per each line.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Meaning</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>TERMINATION_CODE</td>
+      <td>A positive integer indicating success/failure as follows: <br />1 = 
Converged successfully; 2 = Maximum # of iterations reached; 3 = Input (X, Y) 
out of range; 4 = Distribution/link not supported</td>
+    </tr>
+    <tr>
+      <td>BETA_MIN</td>
+      <td>Smallest beta value (regression coefficient), excluding the 
intercept</td>
+    </tr>
+    <tr>
+      <td>BETA_MIN_INDEX</td>
+      <td>Column index for the smallest beta value</td>
+    </tr>
+    <tr>
+      <td>BETA_MAX</td>
+      <td>Largest beta value (regression coefficient), excluding the 
intercept</td>
+    </tr>
+    <tr>
+      <td>BETA_MAX_INDEX</td>
+      <td>Column index for the largest beta value</td>
+    </tr>
+    <tr>
+      <td>INTERCEPT</td>
+      <td>Intercept value, or NaN if there is no intercept (if 
<code>icpt=0</code>)</td>
+    </tr>
+    <tr>
+      <td>DISPERSION</td>
+      <td>Dispersion used to scale deviance, provided in disp input argument  
or estimated (same as DISPERSION_EST) if disp argument is $\leq 0$</td>
+    </tr>
+    <tr>
+      <td>DISPERSION_EST</td>
+      <td>Dispersion estimated from the dataset</td>
+    </tr>
+    <tr>
+      <td>DEVIANCE_UNSCALED</td>
+      <td>Deviance from the saturated model, assuming dispersion $= 1.0$</td>
+    </tr>
+    <tr>
+      <td>DEVIANCE_SCALED</td>
+      <td>Deviance from the saturated model, scaled by DISPERSION value</td>
+    </tr>
+  </tbody>
+</table>
+
+<hr />
+
+<p><a name="table10"></a>
+<strong>Table 10</strong>: The Log file for GLM regression contains the below
+iteration variables in CSV format, each line containing a triple (Name,
+Iteration#, Value) with Iteration# beingÂ 0 for initial values.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Meaning</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>NUM_CG_ITERS</td>
+      <td>Number of inner (Conj.Â Gradient) iterations in this outer 
iteration</td>
+    </tr>
+    <tr>
+      <td>IS_TRUST_REACHED</td>
+      <td>1 = trust region boundary was reached, 0 = otherwise</td>
+    </tr>
+    <tr>
+      <td>POINT_STEP_NORM</td>
+      <td>L2-norm of iteration step from old point ($\beta$-vector) to new 
point</td>
+    </tr>
+    <tr>
+      <td>OBJECTIVE</td>
+      <td>The loss function we minimize (negative partial log-likelihood)</td>
+    </tr>
+    <tr>
+      <td>OBJ_DROP_REAL</td>
+      <td>Reduction in the objective during this iteration, actual value</td>
+    </tr>
+    <tr>
+      <td>OBJ_DROP_PRED</td>
+      <td>Reduction in the objective predicted by a quadratic 
approximation</td>
+    </tr>
+    <tr>
+      <td>OBJ_DROP_RATIO</td>
+      <td>Actual-to-predicted reduction ratio, used to update the trust 
region</td>
+    </tr>
+    <tr>
+      <td>GRADIENT_NORM</td>
+      <td>L2-norm of the loss function gradient (omitted if point is 
rejected)</td>
+    </tr>
+    <tr>
+      <td>LINEAR_TERM_MIN</td>
+      <td>The minimum value of $X$ %*% $\beta$, used to check for 
overflows</td>
+    </tr>
+    <tr>
+      <td>LINEAR_TERM_MAX</td>
+      <td>The maximum value of $X$ %*% $\beta$, used to check for 
overflows</td>
+    </tr>
+    <tr>
+      <td>IS_POINT_UPDATED</td>
+      <td>1 = new point accepted; 0 = new point rejected, old point 
restored</td>
+    </tr>
+    <tr>
+      <td>TRUST_DELTA</td>
+      <td>Updated trust region size, the &#8220;delta&#8221;</td>
+    </tr>
+  </tbody>
+</table>
+
+<hr />
+
+<p><a name="table11"></a>
+<strong>Table 11</strong>: Common GLM distribution families and link 
functions. (Here &#8220;*&#8221; stands
+for &#8220;any value.&#8221;)</p>
+
+<table>
+  <thead>
+    <tr>
+      <th style="text-align: center">dfam</th>
+      <th style="text-align: center">vpow</th>
+      <th style="text-align: center">link</th>
+      <th style="text-align: center">lpow</th>
+      <th style="text-align: center">Distribution<br />Family</th>
+      <th style="text-align: center">Link<br /> Function</th>
+      <th style="text-align: center">Canonical</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">0.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">-1.0</td>
+      <td style="text-align: center">Gaussian</td>
+      <td style="text-align: center">inverse</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">0.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.0</td>
+      <td style="text-align: center">Gaussian</td>
+      <td style="text-align: center">log</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">0.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 1.0</td>
+      <td style="text-align: center">Gaussian</td>
+      <td style="text-align: center">identity</td>
+      <td style="text-align: center">Yes</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">1.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.0</td>
+      <td style="text-align: center">Poisson</td>
+      <td style="text-align: center">log</td>
+      <td style="text-align: center">Yes</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">1.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.5</td>
+      <td style="text-align: center">Poisson</td>
+      <td style="text-align: center">sq.root</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">1.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 1.0</td>
+      <td style="text-align: center">Poisson</td>
+      <td style="text-align: center">identity</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">2.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">-1.0</td>
+      <td style="text-align: center">Gamma</td>
+      <td style="text-align: center">inverse</td>
+      <td style="text-align: center">Yes</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">2.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.0</td>
+      <td style="text-align: center">Gamma</td>
+      <td style="text-align: center">log</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">2.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 1.0</td>
+      <td style="text-align: center">Gamma</td>
+      <td style="text-align: center">identity</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">3.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">-2.0</td>
+      <td style="text-align: center">Inverse Gauss</td>
+      <td style="text-align: center">$1/\mu^2$</td>
+      <td style="text-align: center">Yes</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">3.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">-1.0</td>
+      <td style="text-align: center">Inverse Gauss</td>
+      <td style="text-align: center">inverse</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">3.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.0</td>
+      <td style="text-align: center">Inverse Gauss</td>
+      <td style="text-align: center">log</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">3.0</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 1.0</td>
+      <td style="text-align: center">Inverse Gauss</td>
+      <td style="text-align: center">identity</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.0</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">log</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">1</td>
+      <td style="text-align: center">Â 0.5</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">sq.root</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">Â  *</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">logit</td>
+      <td style="text-align: center">Yes</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">3</td>
+      <td style="text-align: center">Â  *</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">probit</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">4</td>
+      <td style="text-align: center">Â  *</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">cloglog</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+    <tr>
+      <td style="text-align: center">2</td>
+      <td style="text-align: center">*</td>
+      <td style="text-align: center">5</td>
+      <td style="text-align: center">Â  *</td>
+      <td style="text-align: center">Binomial</td>
+      <td style="text-align: center">cauchit</td>
+      <td style="text-align: center">&#160;</td>
+    </tr>
+  </tbody>
+</table>
+
+<hr />
+
+<p><a name="table12"></a>
+<strong>Table 12</strong>: The supported non-power link functions for the 
Bernoulli and the
+binomial distributions. Here $\mu$Â is the Bernoulli mean.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Link Function</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Logit</td>
+      <td>$\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$</td>
+    </tr>
+    <tr>
+      <td>Probit</td>
+      <td><script type="math/tex">\displaystyle \mu  = 
\frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} 
e^{-\frac{t^2}{2}} dt</script></td>
+    </tr>
+    <tr>
+      <td>Cloglog</td>
+      <td>$\displaystyle \eta = \log \big(- \log(1 - 
\mu)\big)^{\mathstrut}$</td>
+    </tr>
+    <tr>
+      <td>Cauchit</td>
+      <td>$\displaystyle \eta = \tan\pi(\mu - 1/2)$</td>
+    </tr>
+  </tbody>
+</table>
+
+<hr />
+
+<h3 id="details-2">Details</h3>
+
+<p>In GLM, the noise distribution
+$Prob[y\mid \mu]$
+of the response variableÂ $y$ given its meanÂ $\mu$ is restricted to have
+the <em>exponential family</em> form</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+Y \sim\, Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
++ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = E(Y) = b'(\theta).
+\end{equation}</script>
+
+<p>Changing the mean in such a distribution simply
+multiplies all 
+$Prob[y\mid \mu]$
+byÂ $e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so
+that they again integrate toÂ 1. Parameter $\theta$ is called
+<em>canonical</em>, and the function $\theta = b&#8217;^{\,-1}(\mu)$ that 
relates it
+to the mean is called theÂ <em>canonical link</em>; constantÂ $a$ is called
+<em>dispersion</em> and rescales the variance ofÂ $y$. Many common 
distributions
+can be put into this form, see <a 
href="algorithms-regression.html#table11"><strong>TableÂ 11</strong></a>. The 
canonical
+parameterÂ $\theta$ is often chosen to coincide withÂ $\eta$, the linear
+combination of the regression features; other choices forÂ $\eta$ are
+possible too.</p>
+
+<p>Rather than specifying the canonical link, GLM distributions are
+commonly defined by their variance
+$Var(y)$ as the function of
+the meanÂ $\mu$. It can be shown from Eq. 5 that
+$Var(y) = a\,b&#8217;&#8217;(\theta) = 
a\,b&#8217;&#8216;(b&#8217;^{\,-1}(\mu))$.
+For example, for the Bernoulli distribution
+$Var(y) = \mu(1-\mu)$, for the
+Poisson distribution
+$Var(y) = \mu$, and for the Gaussian distribution
+$Var(y) = a\cdot 1 = \sigma^2$.
+It turns out that for many common distributions
+$Var(y) = a\mu^q$, a power
+function. We support all distributions where
+$Var(y) = a\mu^q$, as well as
+the Bernoulli and the binomial distributions.</p>
+
+<p>For distributions with
+$Var(y) = a\mu^q$ the
+canonical link is also a power function, namely
+$\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose
+canonical link is $\theta = \log\mu$. We support all power link
+functions in the form $\eta = \mu^s$, dropping any constant factor, with
+$\eta = \log\mu$ for $s=0$. The binomial distribution has its own family
+of link functions, which includes logit (the canonical link), probit,
+cloglog, and cauchit (see <a 
href="algorithms-regression.html#table12"><strong>TableÂ 12</strong></a>); we 
support
+these only for the binomial and Bernoulli distributions. Links and
+distributions are specified via four input parameters:
+<code>dfam</code>, <code>vpow</code>, <code>link</code>, and
+<code>lpow</code> (see <a 
href="algorithms-regression.html#table11"><strong>TableÂ 11</strong></a>).</p>
+
+<p>The observed response values are provided to the regression script as a
+matrixÂ $Y$ having 1 or 2 columns. If a power distribution family is
+selected (<code>dfam=1</code>), matrix $Y$ must have 1Â column that
+provides $y_i$ for eachÂ $x_i$ in the corresponding row of matrixÂ $X$.
+When dfam=2 and $Y$ has 1Â column, we assume the Bernoulli
+distribution for <script type="math/tex">y_i\in\{y_{\mathrm{neg}}, 
1\}</script> with $y_{\mathrm{neg}}$
+from the input parameter <code>yneg</code>. When <code>dfam=2</code> and
+$Y$ has 2Â columns, we assume the binomial distribution; for each rowÂ $i$
+inÂ $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the
+negative binomial counts respectively. Internally we convert the
+1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.</p>
+
+<p>We estimate the regression parameters via L2-regularized negative
+log-likelihood minimization:</p>
+
+<script type="math/tex; mode=display">f(\beta; X, Y) \,\,=\,\, 
-\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
+\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min</script>
+
+<p>where
+$\theta_i$ and $b(\theta_i)$ are fromÂ (6); note that $a$ and
+$c(y, a)$ are constant w.r.t.Â $\beta$ and can be ignored here. The
+canonical parameter $\theta_i$ depends on both $\beta$ andÂ $x_i$:</p>
+
+<script type="math/tex; mode=display">\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) 
\,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
+\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j 
x_{i,j}\right)</script>
+
+<p>The user-provided (via <code>reg</code>) regularization coefficient
+$\lambda\geq 0$ can be used to mitigate overfitting and degeneracy in
+the data. Note that the intercept is never regularized.</p>
+
+<p>Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring
+approximation to the difference
+$\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
+recomputed at each iteration:</p>
+
+<script type="math/tex; mode=display">\begin{gathered}
+\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
+\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!diag(w) X \,+\, \lambda I\\
+\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
+\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
+\forall\,i = 1\ldots n: \,\,\,\,
+w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
+\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
+u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
+\!\!\!\!\!\!.\,\,\,\,\end{gathered}</script>
+
+<p>Here
+$v(\mu_i)=Var(y_i)/a$, the
+variance of $y_i$ as the function of the mean, and
+$g&#8217;(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The
+Fisher scoring approximation is minimized by trust-region conjugate
+gradient iterations (called the <em>inner</em> iterations, with the Fisher
+scoring iterations as the <em>outer</em> iterations), which approximately solve
+the following problem:</p>
+
+<script type="math/tex; mode=display">1/2 \cdot z^T A z \,+\, G^T z 
\,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
+\|z\|_2 \leq \delta</script>
+
+<p>The conjugate gradient algorithm closely follows
+AlgorithmÂ 7.2 on pageÂ 171 of <a 
href="algorithms-bibliography.html">[Nocedal2006]</a>. The trust region
+size $\delta$ is initialized as
+$0.5\sqrt{m}\,/ \max\nolimits_i |x_i|_2$ and updated as described
+inÂ <a href="algorithms-bibliography.html">[Nocedal2006]</a>. 
+The user can specify the maximum number of
+the outer and the inner iterations with input parameters
+<code>moi</code> and <code>mii</code>, respectively. The Fisher scoring
+algorithm terminates successfully if
+$2|\varDelta f(z; \beta)| &lt; (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$
+where ${\varepsilon}&gt; 0$ is a tolerance supplied by the user via
+<code>tol</code>, and $D_1(\beta)$ is the unit-dispersion deviance
+estimated as</p>
+
+<script type="math/tex; mode=display">D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y 
\mid \!
+\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, 
a\,{=}\,1]
+\,\,-\,\,Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)</script>
+
+<p>The deviance estimate is also produced as part of the output. Once the
+Fisher scoring algorithm terminates, if requested by the user, we
+estimate the dispersionÂ $a$ from (6) using Pearson residuals</p>
+
+<script type="math/tex; mode=display">\begin{equation}
+\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - 
\mu_i)^2}{v(\mu_i)}
+\end{equation}</script>
+
+<p>and use it to adjust our deviance estimate:
+$D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument
+disp is 0.0 we estimate $\hat{a}$, otherwise
+we use its value asÂ $a$. Note that inÂ (7) $m$Â counts
+the intercept ($m \leftarrow m+1$) if it is present.</p>
+
+<h3 id="returns-2">Returns</h3>
+
+<p>The estimated regression parameters (the $\hat{\beta}_j$âs) are
+populated into a matrix and written to an HDFS file whose path/name was
+provided as the <code>B</code> input argument. What this matrix
+contains, and its size, depends on the input argument <code>icpt</code>,
+which specifies the userâs intercept and rescaling choice:</p>
+
+<ul>
+  <li><strong>icpt=0</strong>: No intercept, matrixÂ $B$ has size 
$m\,{\times}\,1$, with
+$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 toÂ $m = {}$ncol$(X)$.</li>
+  <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling 
ofÂ $X$; matrixÂ $B$ has
+size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from
+1 toÂ $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept
+coefficient.</li>
+  <li><strong>icpt=2</strong>: There is intercept, and the features inÂ $X$ 
are shifted to mean${} = 0$
+and rescaled to variance${} = 1$; then there are two versions of
+theÂ $\hat{\beta}_j$âs, one for the original features and another for the
+shifted/rescaled features. Now matrixÂ $B$ has size
+$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and
+$B[\cdot, 2]$ for the shifted/rescaled features, in the above format.
+Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are
+obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li>
+</ul>
+
+<p>Our script also estimates the dispersion $\hat{a}$ (or takes it from the
+userâs input) and the deviances $D_1(\hat{\beta})$ and
+$D_{\hat{a}}(\hat{\beta})$, see <a 
href="algorithms-regression.html#table9"><strong>TableÂ 9</strong></a> for 
details. A
+log file with variables monitoring progress through the iterations can
+also be made available, see <a 
href="algorithms-regression.html#table10"><strong>TableÂ 10</strong></a>.</p>
+
+<h3 id="see-also">See Also</h3>
+
+<p>In case of binary classification problems, consider using L2-SVM or
+binary logistic regression; for multiclass classification, use
+multiclassÂ SVM or multinomial logistic regression. For the special cases
+of linear regression and logistic regression, it may be more efficient
+to use the corresponding specialized scripts instead ofÂ GLM.</p>
+
+<hr />
+
+<h2 id="stepwise-generalized-linear-regression">4.4. Stepwise Generalized 
Linear Regression</h2>
+
+<h3 id="description-3">Description</h3>
+
+<p>Our stepwise generalized linear regression script selects a model based
+on the Akaike information criterion (AIC): the model that gives rise to
+the lowest AIC is provided. Note that currently only the Bernoulli
+distribution family is supported (see below for details).</p>
+
+<h3 id="usage-3">Usage</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f StepGLM.dml
+                        -nvargs X=&lt;file&gt;
+                                Y=&lt;file&gt;
+                                B=&lt;file&gt;
+                                S=[file]
+                                O=[file]
+                                link=[int]
+                                yneg=[double]
+                                icpt=[int]
+                                tol=[double]
+                                disp=[double]
+                                moi=[int]
+                                mii=[int]
+                                thr=[double]
+                                fmt=[format]
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f StepGLM.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=&lt;file&gt;
+                                     Y=&lt;file&gt;
+                                     B=&lt;file&gt;
+                                     S=[file]
+                                     O=[file]
+                                     link=[int]
+                                     yneg=[double]
+                                     icpt=[int]
+                                     tol=[double]
+                                     disp=[double]
+                                     moi=[int]
+                                     mii=[int]
+                                     thr=[double]
+                                     fmt=[format]
+</code></pre>
+  </div>
+</div>
+
+<h3 id="arguments-for-spark-and-hadoop-invocation-3">Arguments for Spark and 
Hadoop invocation</h3>
+
+<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature 
vectors; each row is an
+example.</p>
+
+<p><strong>Y</strong>: Location (on HDFS) to read the response matrix, which 
may have 1 or 2
+columns</p>
+
+<p><strong>B</strong>: Location (on HDFS) to store the estimated regression 
parameters (the
+$\beta_j$âs), with the intercept parameterÂ $\beta_0$ at position
+B[$m\,{+}\,1$, 1] if available</p>
+
+<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store 
the selected feature-ids in the
+order as computed by the algorithm, by default it is standard output.</p>
+
+<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to write 
certain summary statistics
+described in <a href="algorithms-regression.html#table9"><strong>TableÂ 
9</strong></a>, by default it is standard
+output.</p>
+
+<p><strong>link</strong>: (default: <code>2</code>) Link function code to 
determine the link
+functionÂ $\eta = g(\mu)$, see <a 
href="algorithms-regression.html#table11"><strong>Table 11</strong></a>; 
currently the
+following link functions are supported:</p>
+
+<ul>
+  <li>1 = log</li>
+  <li>2 = logit</li>
+  <li>3 = probit</li>
+  <li>4 = cloglog</li>
+</ul>
+
+<p><strong>yneg</strong>: (default: <code>0.0</code>) Response value for 
Bernoulli &#8220;No&#8221; label, usually 0.0 or -1.0</p>
+
+<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and 
shifting/rescaling of the features inÂ $X$:</p>
+
+<ul>
+  <li>0 = no intercept (hence noÂ $\beta_0$), no
+shifting/rescaling of the features</li>
+  <li>1 = add intercept, but do not shift/rescale the features
+inÂ $X$</li>
+  <li>2 = add intercept, shift/rescale the features inÂ $X$ to
+meanÂ 0, varianceÂ 1</li>
+</ul>
+
+<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance 
($\epsilon$) used in the convergence criterion: we
+terminate the outer iterations when the deviance changes by less than
+this factor; see below for details.</p>
+
+<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or 
<code>0.0</code> to estimate it from
+data</p>
+
+<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer 
(Fisher scoring) iterations</p>
+
+<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner 
(conjugate gradient) iterations, orÂ 0
+if no maximum limit provided</p>
+
+<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the 
algorithm: if the decrease in the value
+of the AIC falls below <code>thr</code> no further features are being
+checked and the algorithm stops.</p>
+
+<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output 
format, such as <code>text</code>,
+<code>mm</code>, or <code>csv</code>; see read/write functions in
+SystemML Language Reference for details.</p>
+
+<h3 id="examples-3">Examples</h3>
+
+<div class="codetabs">
+<div data-lang="Hadoop">
+    <pre><code>hadoop jar SystemML.jar -f StepGLM.dml
+                        -nvargs X=/user/ml/X.mtx
+                                Y=/user/ml/Y.mtx
+                                B=/user/ml/B.mtx
+                                S=/user/ml/selected.csv
+                                O=/user/ml/stats.csv
+                                link=2
+                                yneg=-1.0
+                                icpt=2
+                                tol=0.000001
+                                moi=100
+                                mii=10
+                                thr=0.05
+                                fmt=csv
+</code></pre>
+  </div>
+<div data-lang="Spark">
+    <pre><code>$SPARK_HOME/bin/spark-submit --master yarn
+                             --deploy-mode cluster
+                             --conf spark.driver.maxResultSize=0
+                             SystemML.jar
+                             -f StepGLM.dml
+                             -config SystemML-config.xml
+                             -exec hybrid_spark
+                             -nvargs X=/user/ml/X.mtx
+                                     Y=/user/ml/Y.mtx
+                                     B=/user/ml/B.mtx
+                                     S=/user/ml/selected.csv
+                                     O=/user/ml/stats.csv
+                                     link=2
+                                     yneg=-1.0
+                                     icpt=2
+                                     tol=0.000001
+                                     moi=100
+                                     mii=10
+                                     thr=0.05
+                                     fmt=csv
+</code></pre>
+  </div>
+</div>
+
+<h3 id="details-3">Details</h3>
+
+<p>Similar to <code>StepLinearRegDS.dml</code> our stepwise GLM script
+builds a model by iteratively selecting predictive variables using a
+forward selection strategy based on the AIC (5). Note that
+currently only the Bernoulli distribution family (<code>fam=2</code> in
+<a href="algorithms-regression.html#table11"><strong>TableÂ 11</strong></a>) 
together with the following link functions
+are supported: <code>log</code>, <code>logit</code>, <code>probit</code>, and 
<code>cloglog</code> (link
+<script type="math/tex">\in\{1,2,3,4\}</script> in <a 
href="algorithms-regression.html#table11"><strong>TableÂ 11</strong></a>).</p>
+
+<h3 id="returns-3">Returns</h3>
+
+<p>Similar to the outputs from <code>GLM.dml</code> the stepwise GLM script
+computes the estimated regression coefficients and stores them in matrix
+$B$ on HDFS; matrix $B$ follows the same format as the one produced by
+<code>GLM.dml</code> (see <a 
href="algorithms-regression.html#generalized-linear-models">Generalized Linear 
Models</a>). Additionally,
+<code>StepGLM.dml</code> outputs the variable indices (stored in the
+1-column matrix $S$) in the order they have been selected by the
+algorithm, i.e., $i$th entry in matrix $S$ stores the variable which
+improves the AIC the most in $i$th iteration. If the model with the
+lowest AIC includes no variables matrix $S$ will be empty. Moreover, the
+estimated summary statistics as defined in <a 
href="algorithms-regression.html#table9"><strong>TableÂ 9</strong></a> are
+printed out or stored in a file on HDFS (if requested); these statistics
+will be provided only if the selected model is nonempty, i.e., contains
+at least one variable.</p>
+
+<hr />
+
+<h2 id="regression-scoring-and-prediction">4.5. Regression Scoring and 
Prediction</h2>
+
+<h3 id="description-4">Description</h3>
+
+<p>Script <code>GLM-predict.dml</code> is intended to cover all linear
+model based regressions, including linear regression, binomial and
+multinomial logistic regression, and GLM regressions (Poisson, gamma,
+binomial with probit linkÂ etc.). Having just one scoring script for all
+these regressions simplifies maintenance and enhancement while ensuring
+compatible interpretations for output statistics.</p>
+
+<p>The script performs two functions, prediction and scoring. To perform
+prediction, the script takes two matrix inputs: a collection of records
+$X$ (without the response attribute) and the estimated regression
+parametersÂ $B$, also known asÂ $\beta$. To perform scoring, in addition
+to $X$ andÂ $B$, the script takes the matrix of actual response
+valuesÂ $Y$ that are compared to the predictions made with $X$ andÂ $B$.
+Of course there are other, non-matrix, input arguments that specify the
+model and the output format, see below for the full list.</p>
+
+<p>We assume that our test/scoring dataset is given by
+$n\,{\times}\,m$-matrix $X$ of numerical feature vectors, where each
+rowÂ $x_i$ represents one feature vector of one record; we have $\dim x_i = 
m$. Each
+record also includes the response variableÂ $y_i$ that may be numerical,
+single-label categorical, or multi-label categorical. A single-label
+categorical $y_i$ is an integer category label, one label per record; a
+multi-label $y_i$ is a vector of integer counts, one count for each
+possible label, which represents multiple single-label events
+(observations) for the sameÂ $x_i$. Internally we convert single-label


[... 939 lines stripped ...]

svn commit: r1828046 [6/20] - /systemml/site/docs/1.1.0/

Reply via email to