Added: systemml/site/docs/1.1.0/algorithms-regression.html URL: http://svn.apache.org/viewvc/systemml/site/docs/1.1.0/algorithms-regression.html?rev=1828046&view=auto ============================================================================== --- systemml/site/docs/1.1.0/algorithms-regression.html (added) +++ systemml/site/docs/1.1.0/algorithms-regression.html Fri Mar 30 04:31:05 2018 @@ -0,0 +1,2759 @@ +<!DOCTYPE html> +<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> +<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> +<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> +<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--> + <head> + <title>SystemML Algorithms Reference - Regression - SystemML 1.1.0</title> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> + + <meta name="viewport" content="width=device-width"> + <link rel="stylesheet" href="css/bootstrap.min.css"> + <link rel="stylesheet" href="css/main.css"> + <link rel="stylesheet" href="css/pygments-default.css"> + <link rel="shortcut icon" href="img/favicon.png"> + </head> + <body> + <!--[if lt IE 7]> + <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p> + <![endif]--> + + <header class="navbar navbar-default navbar-fixed-top" id="topbar"> + <div class="container"> + <div class="navbar-header"> + <div class="navbar-brand brand projectlogo"> + <a href="http://systemml.apache.org/"><img class="logo" src="img/systemml-logo.png" alt="Apache SystemML" title="Apache SystemML"/></a> + </div> + <div class="navbar-brand brand projecttitle"> + <a href="http://systemml.apache.org/">Apache SystemML<sup id="trademark">â¢</sup></a><br/> + <span class="version">1.1.0</span> + </div> + <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + </div> + <nav class="navbar-collapse collapse"> + <ul class="nav navbar-nav navbar-right"> + <li><a href="index.html">Overview</a></li> + <li><a href="https://github.com/apache/systemml">GitHub</a></li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown">Documentation<b class="caret"></b></a> + <ul class="dropdown-menu" role="menu"> + <li><b>Running SystemML:</b></li> + <li><a href="https://github.com/apache/systemml">SystemML GitHub README</a></li> + <li><a href="spark-mlcontext-programming-guide.html">Spark MLContext</a></li> + <li><a href="spark-batch-mode.html">Spark Batch Mode</a> + <li><a href="hadoop-batch-mode.html">Hadoop Batch Mode</a> + <li><a href="standalone-guide.html">Standalone Guide</a></li> + <li><a href="jmlc.html">Java Machine Learning Connector (JMLC)</a> + <li class="divider"></li> + <li><b>Language Guides:</b></li> + <li><a href="dml-language-reference.html">DML Language Reference</a></li> + <li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li> + <li><a href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li> + <li><a href="python-reference.html">Reference Guide for Python Users</a></li> + <li class="divider"></li> + <li><b>ML Algorithms:</b></li> + <li><a href="algorithms-reference.html">Algorithms Reference</a></li> + <li class="divider"></li> + <li><b>Tools:</b></li> + <li><a href="debugger-guide.html">Debugger Guide</a></li> + <li><a href="developer-tools-systemml.html">IDE Guide</a></li> + <li class="divider"></li> + <li><b>Other:</b></li> + <li><a href="contributing-to-systemml.html">Contributing to SystemML</a></li> + <li><a href="engine-dev-guide.html">Engine Developer Guide</a></li> + <li><a href="troubleshooting-guide.html">Troubleshooting Guide</a></li> + <li><a href="release-process.html">Release Process</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> + <ul class="dropdown-menu" role="menu"> + <li><a href="./api/java/index.html">Java</a></li> + <li><a href="./api/python/index.html">Python</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown">Issues<b class="caret"></b></a> + <ul class="dropdown-menu" role="menu"> + <li><b>JIRA:</b></li> + <li><a href="https://issues.apache.org/jira/browse/SYSTEMML">SystemML JIRA</a></li> + + </ul> + </li> + </ul> + </nav> + </div> + </header> + + <div class="container" id="content"> + + <h1 class="title"><a href="algorithms-reference.html">SystemML Algorithms Reference</a></h1> + + + <!-- + +--> + +<h1 id="regression">4. Regression</h1> + +<h2 id="linear-regression">4.1. Linear Regression</h2> + +<h3 id="description">Description</h3> + +<p>Linear Regression scripts are used to model the relationship between one +numerical response variable and one or more explanatory (feature) +variables. The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ +where $x_i$ is a numerical vector of feature variables and $y_i$ is a +numerical response value for each training data record. The feature +vectors are provided as a matrix $X$ of size $n\,{\times}\,m$, where $n$ +is the number of records and $m$ is the number of features. The observed +response values are provided as a 1-column matrix $Y$, with a numerical +value $y_i$ for each $x_i$ in the corresponding row of matrix $X$.</p> + +<p>In linear regression, we predict the distribution of the response $y_i$ +based on a fixed linear combination of the features in $x_i$. We assume +that there exist constant regression coefficients +$\beta_0, \beta_1, \ldots, \beta_m$ and a constant residual +variance $\sigma^2$ such that</p> + +<script type="math/tex; mode=display">\begin{equation} +y_i \sim Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\, +\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} +\end{equation}</script> + +<p>Distribution +$y_i \sim Normal(\mu_i, \sigma^2)$ +models the “unexplained” residual noise and is assumed independent +across different records.</p> + +<p>The goal is to estimate the regression coefficients and the residual +variance. Once they are accurately estimated, we can make predictions +about $y_i$ given $x_i$ in new records. We can also use the $\beta_j$âs +to analyze the influence of individual features on the response value, +and assess the quality of this model by comparing residual variance in +the response, left after prediction, with its total variance.</p> + +<p>There are two scripts in our library, both doing the same estimation, +but using different computational methods. Depending on the size and the +sparsity of the feature matrix $X$, one or the other script may be more +efficient. The “direct solve” script <code>LinearRegDS.dml</code> is more +efficient when the number of features $m$ is relatively small +($m \sim 1000$ or less) and matrix $X$ is either tall or fairly dense +(has ${\gg}:m^2$ nonzeros); otherwise, the “conjugate gradient” script +<code>LinearRegCG.dml</code> is more efficient. If $m > 50000$, use only +<code>LinearRegCG.dml</code>.</p> + +<h3 id="usage">Usage</h3> + +<p><strong>Linear Regression - Direct Solve</strong>:</p> + +<div class="codetabs"> +<div data-lang="Python"> + + <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> +<span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span> +<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span> +<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame or SciPy Sparse Matrix</span> +<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> +<span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span> +<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div> + + </div> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml + -nvargs X=<file> + Y=<file> + B=<file> + O=[file] + icpt=[int] + reg=[double] + fmt=[format] +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f LinearRegDS.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=<file> + Y=<file> + B=<file> + O=[file] + icpt=[int] + reg=[double] + fmt=[format] +</code></pre> + </div> +</div> + +<p><strong>Linear Regression - Conjugate Gradient</strong>:</p> + +<div class="codetabs"> +<div data-lang="Python"> + + <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> +<span class="c"># C = 1/reg (to disable regularization, use float("inf"))</span> +<span class="n">lr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">sqlCtx</span><span class="p">,</span> <span class="n">fit_intercept</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">tol</span><span class="o">=</span><span class="mf">0.000001</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">),</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span> +<span class="c"># X_train, y_train and X_test can be NumPy matrices or Pandas DataFrames or SciPy Sparse matrices</span> +<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span> +<span class="c"># df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"</span> +<span class="n">y_test</span> <span class="o">=</span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df_train</span><span class="p">)</span></code></pre></div> + + </div> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml + -nvargs X=<file> + Y=<file> + B=<file> + O=[file] + Log=[file] + icpt=[int] + reg=[double] + tol=[double] + maxi=[int] + fmt=[format] +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f LinearRegCG.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=<file> + Y=<file> + B=<file> + O=[file] + Log=[file] + icpt=[int] + reg=[double] + tol=[double] + maxi=[int] + fmt=[format] +</code></pre> + </div> +</div> + +<h3 id="arguments-for-spark-and-hadoop-invocation">Arguments for Spark and Hadoop invocation</h3> + +<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row +constitutes one feature vector</p> + +<p><strong>Y</strong>: Location to read the 1-column matrix of response values</p> + +<p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$âs), +with the intercept parameter $\beta_0$ at position +B[$m\,{+}\,1$, 1] if available</p> + +<p><strong>O</strong>: (default: <code>" "</code>) Location to store the CSV-file of summary statistics defined +in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>, the default is to print it to the +standard output</p> + +<p><strong>Log</strong>: (default: <code>" "</code>, <code>LinearRegCG.dml</code> only) Location to store +iteration-specific variables for monitoring and debugging purposes, see +<a href="algorithms-regression.html#table8"><strong>Table 8</strong></a> +for details.</p> + +<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features +in $X$:</p> + +<ul> + <li>0 = no intercept (hence no $\beta_0$), no shifting or +rescaling of the features</li> + <li>1 = add intercept, but do not shift/rescale the features +in $X$</li> + <li>2 = add intercept, shift/rescale the features in $X$ to +mean 0, variance 1</li> +</ul> + +<p><strong>reg</strong>: (default: <code>0.000001</code>) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly +dependent, sparse, or numerous ($m \gtrsim n/10$) features</p> + +<p><strong>tol</strong>: (default: <code>0.000001</code>, <code>LinearRegCG.dml</code> only) Tolerance $\varepsilon\geq 0$ used in the +convergence criterion: we terminate conjugate gradient iterations when +the $\beta$-residual reduces in L2-norm by this factor</p> + +<p><strong>maxi</strong>: (default: <code>0</code>, <code>LinearRegCG.dml</code> only) Maximum number of conjugate +gradient iterations, or <code>0</code> if no maximum limit provided</p> + +<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, +<code>mm</code>, or <code>csv</code>; see read/write functions in +SystemML Language Reference for details.</p> + +<p>Please see <a href="https://apache.github.io/systemml/python-reference#mllearn-api">mllearn documentation</a> for +more details on the Python API.</p> + +<h3 id="examples">Examples</h3> + +<p><strong>Linear Regression - Direct Solve</strong>:</p> + +<div class="codetabs"> +<div data-lang="Python"> + + <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> +<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span> +<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> +<span class="c"># Load the diabetes dataset</span> +<span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span> +<span class="c"># Use only one feature</span> +<span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> +<span class="c"># Split the data into training/testing sets</span> +<span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> +<span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> +<span class="c"># Split the targets into training/testing sets</span> +<span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> +<span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> +<span class="c"># Create linear regression object</span> +<span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'direct-solve'</span><span class="p">)</span> +<span class="c"># Train the model using the training sets</span> +<span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span> +<span class="c"># The mean square error</span> +<span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div> + + </div> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f LinearRegDS.dml + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + O=/user/ml/stats.csv + icpt=2 + reg=1.0 +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f LinearRegDS.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + O=/user/ml/stats.csv + icpt=2 + reg=1.0 +</code></pre> + </div> +</div> + +<p><strong>Linear Regression - Conjugate Gradient</strong>:</p> + +<div class="codetabs"> +<div data-lang="Python"> + + <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> +<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span> +<span class="kn">from</span> <span class="nn">systemml.mllearn</span> <span class="kn">import</span> <span class="n">LinearRegression</span> +<span class="c"># Load the diabetes dataset</span> +<span class="n">diabetes</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">load_diabetes</span><span class="p">()</span> +<span class="c"># Use only one feature</span> +<span class="n">diabetes_X</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> +<span class="c"># Split the data into training/testing sets</span> +<span class="n">diabetes_X_train</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> +<span class="n">diabetes_X_test</span> <span class="o">=</span> <span class="n">diabetes_X</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> +<span class="c"># Split the targets into training/testing sets</span> +<span class="n">diabetes_y_train</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[:</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> +<span class="n">diabetes_y_test</span> <span class="o">=</span> <span class="n">diabetes</span><span class="o">.</span><span class="n">target</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">:]</span> +<span class="c"># Create linear regression object</span> +<span class="n">regr</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">)</span> +<span class="c"># Train the model using the training sets</span> +<span class="n">regr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">diabetes_X_train</span><span class="p">,</span> <span class="n">diabetes_y_train</span><span class="p">)</span> +<span class="c"># The mean square error</span> +<span class="k">print</span><span class="p">(</span><span class="s">"Residual sum of squares: </span><span class="si">%.2</span><span class="s">f"</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">((</span><span class="n">regr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">diabetes_X_test</span><span class="p">)</span> <span class="o">-</span> <span class="n">diabetes_y_test</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span></code></pre></div> + + </div> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f LinearRegCG.dml + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + O=/user/ml/stats.csv + icpt=2 + reg=1.0 + tol=0.00000001 + maxi=100 + Log=/user/ml/log.csv +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f LinearRegCG.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + O=/user/ml/stats.csv + icpt=2 + reg=1.0 + tol=0.00000001 + maxi=100 + Log=/user/ml/log.csv +</code></pre> + </div> +</div> + +<hr /> + +<p><a name="table7"></a> +<strong>Table 7</strong>: Besides $\beta$, linear regression scripts compute a few summary statistics +listed below. The statistics are provided in CSV format, one comma-separated name-value +pair per each line.</p> + +<table> + <thead> + <tr> + <th>Name</th> + <th>Meaning</th> + </tr> + </thead> + <tbody> + <tr> + <td>AVG_TOT_Y</td> + <td>Average of the response value $Y$</td> + </tr> + <tr> + <td>STDEV_TOT_Y</td> + <td>Standard Deviation of the response value $Y$</td> + </tr> + <tr> + <td>AVG_RES_Y</td> + <td>Average of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$, i.e. residual bias</td> + </tr> + <tr> + <td>STDEV_RES_Y</td> + <td>Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y \mid X)$</td> + </tr> + <tr> + <td>DISPERSION</td> + <td>GLM-style dispersion, i.e. residual sum of squares / #deg. fr.</td> + </tr> + <tr> + <td>R2</td> + <td>$R^2$ of residual with bias included vs. total average</td> + </tr> + <tr> + <td>ADJUSTED_R2</td> + <td>Adjusted $R^2$ of residual with bias included vs. total average</td> + </tr> + <tr> + <td>R2_NOBIAS</td> + <td>Plain $R^2$ of residual with bias subtracted vs. total average</td> + </tr> + <tr> + <td>ADJUSTED_R2_NOBIAS</td> + <td>Adjusted $R^2$ of residual with bias subtracted vs. total average</td> + </tr> + <tr> + <td>R2_VS_0</td> + <td>* $R^2$ of residual with bias included vs. zero constant</td> + </tr> + <tr> + <td>ADJUSTED_R2_VS_0</td> + <td>* Adjusted $R^2$ of residual with bias included vs. zero constant</td> + </tr> + </tbody> +</table> + +<p>* The last two statistics are only printed if there is no intercept (<code>icpt=0</code>)</p> + +<hr /> + +<p><a name="table8"></a> +<strong>Table 8</strong>: The <code>Log</code> file for the <code>LinearRegCG.dml</code> script +contains the above iteration variables in CSV format, each line +containing a triple (Name, Iteration#, Value) with Iteration# being 0 +for initial values.</p> + +<table> + <thead> + <tr> + <th>Name</th> + <th>Meaning</th> + </tr> + </thead> + <tbody> + <tr> + <td>CG_RESIDUAL_NORM</td> + <td>L2-norm of conjug. grad. residual, which is $A$ %*% $\beta - t(X)$ %*% $y$ where $A = t(X)$ %*% $X + diag(\lambda)$, or a similar quantity</td> + </tr> + <tr> + <td>CG_RESIDUAL_RATIO</td> + <td>Ratio of current L2-norm of conjug. grad. residual over the initial</td> + </tr> + </tbody> +</table> + +<hr /> + +<h3 id="details">Details</h3> + +<p>To solve a linear regression problem over feature matrix $X$ and +response vector $Y$, we can find coefficients +$\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize the +joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed +statistical model (1). Since the joint likelihood of the +independent +$y_i \sim Normal(\mu_i, \sigma^2)$ +is proportional to the product of +$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the +logarithm of this product, then multiply by $-2\sigma^2 < 0$ to obtain a +least squares problem:</p> + +<script type="math/tex; mode=display">\begin{equation} +\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, +\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 +\,\,\to\,\,\min +\end{equation}</script> + +<p>This may not be enough, however. The minimum may +sometimes be attained over infinitely many $\beta$-vectors, for example +if $X$ has an all-0 column, or has linearly dependent columns, or has +fewer rows than columns . Even if (2) has a unique +solution, other $\beta$-vectors may be just a little suboptimal<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, yet +give significantly different predictions for new feature vectors. This +results in <em>overfitting</em>: prediction error for the training data ($X$ +and $Y$) is much smaller than for the test data (new records).</p> + +<p>Overfitting and degeneracy in the data is commonly mitigated by adding a +regularization penalty term to the least squares function:</p> + +<script type="math/tex; mode=display">\begin{equation} +\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 +\,+\,\, \lambda \sum_{j=1}^m \beta_j^2 +\,\,\to\,\,\min +\end{equation}</script> + +<p>The choice of $\lambda>0$, the regularization +constant, typically involves cross-validation where the dataset is +repeatedly split into a training part (to estimate the $\beta_j$âs) and +a test part (to evaluate prediction accuracy), with the goal of +maximizing the test accuracy. In our scripts, $\lambda$ is provided as +input parameter <code>reg</code>.</p> + +<p>The solution to the least squares problem (3), through +taking the derivative and setting it to 0, has the matrix linear +equation form</p> + +<script type="math/tex; mode=display">\begin{equation} +A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\, +\textrm{where}\,\,\, +A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} diag(\hspace{0.5pt} +\underbrace{\lambda,\ldots, \lambda}_{\scriptstyle m}, 0) +\end{equation}</script> + +<p>where $[X,\,1]$ is $X$ with an extra column of 1s +appended on the right, and the diagonal matrix of $\lambda$âs has a zero +to keep the intercept $\beta_0$ unregularized. If the intercept is +disabled by setting $icpt=0$, the equation is simply $X^T X \beta = X^T Y$.</p> + +<p>We implemented two scripts for solving equation (4): one +is a “direct solver” that computes $A$ and then solves +$A\beta = [X,\,1]^T Y$ by calling an external package, the other +performs linear conjugate gradient (CG) iterations without ever +materializing $A$. The CG algorithm closely follows Algorithm 5.2 in +Chapter 5 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. Each step in the CG algorithm +computes a matrix-vector multiplication $q = Ap$ by first computing +$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of +such multiplications, one per CG iteration, is much smaller than $m$. +The user can put a hard bound on it with input +parameter <code>maxi</code>, or use the default maximum of $m+1$ (or $m$ +if no intercept) by having <code>maxi=0</code>. The CG iterations +terminate when the L2-norm of vector $r = A\beta - [X,\,1]^T Y$ +decreases from its initial value (for $\beta=0$) by the tolerance factor +specified in input parameter <code>tol</code>.</p> + +<p>The CG algorithm is more efficient if computing +$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$, +an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver (DS) is more +efficient if $X$ takes up a lot more memory than $A$ (i.e. $X$ has a lot +more nonzeros than $m^2$) and if $m^2$ is small enough for the external +solver ($m \lesssim 50000$). A more precise determination between CG +and DS is subject to further research.</p> + +<p>In addition to the $\beta$-vector, the scripts estimate the residual +standard deviation $\sigma$ and the $R^2$, the ratio of “explained” +variance to the total variance of the response variable. These +statistics only make sense if the number of degrees of freedom +$n\,{-}\,m\,{-}\,1$ is positive and the regularization constant +$\lambda$ is negligible or zero. The formulas for $\sigma$ and +$R^2$ are:</p> + +<script type="math/tex; mode=display">R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad +\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad +R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}</script> + +<p>where</p> + +<script type="math/tex; mode=display">\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - +\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad +\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2</script> + +<p>Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the +estimated regression coefficients and the feature vectors. They may be +biased when no intercept is present, hence the RSS formula subtracts the +bias.</p> + +<p>Lastly, note that by choosing the input option <code>icpt=2</code> the +user can shift and rescale the columns of $X$ to have zero average and +the variance of 1. This is particularly important when using +regularization over highly disbalanced features, because regularization +tends to penalize small-variance columns (which need large $\beta_j$âs) +more than large-variance columns (with small $\beta_j$âs). At the end, +the estimated regression coefficients are shifted and rescaled to apply +to the original features.</p> + +<h3 id="returns">Returns</h3> + +<p>The estimated regression coefficients (the $\hat{\beta}_j$âs) are +populated into a matrix and written to an HDFS file whose path/name was +provided as the <code>B</code> input argument. What this matrix +contains, and its size, depends on the input argument <code>icpt</code>, +which specifies the userâs intercept and rescaling choice:</p> + +<ul> + <li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with +$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li> + <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has +size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from +1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept +coefficient.</li> + <li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean$ = 0$ +and rescaled to variance$ = 1$; then there are two versions of +the $\hat{\beta}_j$âs, one for the original features and another for the +shifted/rescaled features. Now matrix $B$ has size +$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and +$B[\cdot, 2]$ for the shifted/rescaled features, in the above format. +Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are +obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li> +</ul> + +<p>The estimated summary statistics, including residual standard +deviation $\sigma$ and the $R^2$, are printed out or sent into a file +(if specified) in CSV format as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>. +For conjugate gradient iterations, a log file with monitoring variables +can also be made available, see <a href="algorithms-regression.html#table8"><strong>Table 8</strong></a>.</p> + +<hr /> + +<h2 id="stepwise-linear-regression">4.2. Stepwise Linear Regression</h2> + +<h3 id="description-1">Description</h3> + +<p>Our stepwise linear regression script selects a linear model based on +the Akaike information criterion (AIC): the model that gives rise to the +lowest AIC is computed.</p> + +<h3 id="usage-1">Usage</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml + -nvargs X=<file> + Y=<file> + B=<file> + S=[file] + O=[file] + icpt=[int] + thr=[double] + fmt=[format] +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f StepLinearRegDS.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=<file> + Y=<file> + B=<file> + S=[file] + O=[file] + icpt=[int] + thr=[double] + fmt=[format] +</code></pre> + </div> +</div> + +<h3 id="arguments-for-spark-and-hadoop-invocation-1">Arguments for Spark and Hadoop invocation</h3> + +<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors, each row +contains one feature vector.</p> + +<p><strong>Y</strong>: Location (on HDFS) to read the 1-column matrix of response values</p> + +<p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the +$\beta_j$âs), with the intercept parameter $\beta_0$ at position +B[$m\,{+}\,1$, 1] if available</p> + +<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the +order as computed by the algorithm; by default the selected feature-ids +are forwarded to the standard output.</p> + +<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to store the CSV-file of summary +statistics defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a>; by default the +summary statistics are forwarded to the standard output.</p> + +<p><strong>icpt</strong>: (default: <code>0</code>) Intercept presence and shifting/rescaling the features +in $X$:</p> + +<ul> + <li>0 = no intercept (hence no $\beta_0$), no shifting or +rescaling of the features;</li> + <li>1 = add intercept, but do not shift/rescale the features +in $X$;</li> + <li>2 = add intercept, shift/rescale the features in $X$ to +mean 0, variance 1</li> +</ul> + +<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value +of the AIC falls below <code>thr</code> no further features are being +checked and the algorithm stops.</p> + +<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, +<code>mm</code>, or <code>csv</code>; see read/write functions in +SystemML Language Reference for details.</p> + +<h3 id="examples-1">Examples</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f StepLinearRegDS.dml + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + S=/user/ml/selected.csv + O=/user/ml/stats.csv + icpt=2 + thr=0.05 + fmt=csv +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f StepLinearRegDS.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + S=/user/ml/selected.csv + O=/user/ml/stats.csv + icpt=2 + thr=0.05 + fmt=csv +</code></pre> + </div> +</div> + +<h3 id="details-1">Details</h3> + +<p>Stepwise linear regression iteratively selects predictive variables in +an automated procedure. Currently, our implementation supports forward +selection: starting from an empty model (without any variable) the +algorithm examines the addition of each variable based on the AIC as a +model comparison criterion. The AIC is defined as</p> + +<script type="math/tex; mode=display">\begin{equation} +AIC = -2 \log{L} + 2 edf,\label{eq:AIC} +\end{equation}</script> + +<p>where $L$ denotes the +likelihood of the fitted model and $edf$ is the equivalent degrees of +freedom, i.e., the number of estimated parameters. This procedure is +repeated until including no additional variable improves the model by a +certain threshold specified in the input parameter <code>thr</code>.</p> + +<p>For fitting a model in each iteration we use the <code>direct solve</code> method +as in the script <code>LinearRegDS.dml</code> discussed in +<a href="algorithms-regression.html#linear-regression">Linear Regression</a>.</p> + +<h3 id="returns-1">Returns</h3> + +<p>Similar to the outputs from <code>LinearRegDS.dml</code> the stepwise +linear regression script computes the estimated regression coefficients +and stores them in matrix $B$ on HDFS. The format of matrix $B$ is +identical to the one produced by the scripts for linear regression (see +<a href="algorithms-regression.html#linear-regression">Linear Regression</a>). Additionally, <code>StepLinearRegDS.dml</code> +outputs the variable indices (stored in the 1-column matrix $S$) in the +order they have been selected by the algorithm, i.e., $i$th entry in +matrix $S$ corresponds to the variable which improves the AIC the most +in $i$th iteration. If the model with the lowest AIC includes no +variables matrix $S$ will be empty (contains one 0). Moreover, the +estimated summary statistics as defined in <a href="algorithms-regression.html#table7"><strong>Table 7</strong></a> +are printed out or stored in a file (if requested). In the case where an +empty model achieves the best AIC these statistics will not be produced.</p> + +<hr /> + +<h2 id="generalized-linear-models">4.3. Generalized Linear Models</h2> + +<h3 id="description-2">Description</h3> + +<p>Generalized Linear Models +[<a href="algorithms-bibliography.html">Gill2000</a>, +<a href="algorithms-bibliography.html">McCullagh1989</a>, +<a href="algorithms-bibliography.html">Nelder1972</a>] +extend the methodology of linear +and logistic regression to a variety of distributions commonly assumed +as noise effects in the response variable. As before, we are given a +collection of records $(x_1, y_1)$, â¦, $(x_n, y_n)$ where $x_i$ is a +numerical vector of explanatory (feature) variables of size $\dim x_i = m$, and $y_i$ +is the response (dependent) variable observed for this vector. GLMs +assume that some linear combination of the features in $x_i$ determines +the <em>mean</em> $\mu_i$ of $y_i$, while the observed $y_i$ is a random +outcome of a noise distribution +$Prob[y\mid \mu_i]\,$<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> +with that mean $\mu_i$:</p> + +<script type="math/tex; mode=display">x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} +\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim Prob[y\mid \mu_i]</script> + +<p>In linear regression the response mean $\mu_i$ <em>equals</em> some linear +combination over $x_i$, denoted above by $\eta_i$. In logistic +regression with <script type="math/tex">y\in\{0, 1\}</script> (Bernoulli) the mean of $y$ is the same +as $Prob[y=1]$ +and equals $1/(1+e^{-\eta_i})$, the logistic function of $\eta_i$. In +GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone +function called the <em>link function</em>: $\eta_i = g(\mu_i)$. The unknown +linear combination parameters $\beta_j$ are assumed to be the same for +all records.</p> + +<p>The goal of the regression is to estimate the parameters $\beta_j$ from +the observed data. Once the $\beta_j$âs are accurately estimated, we can +make predictions about $y$ for a new feature vector $x$. To do so, +compute $\eta$ from $x$ and use the inverted link function +$\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of $y$; then use the +distribution $Prob[y\mid \mu]$ +to make predictions about $y$. Both $g(\mu)$ and +$Prob[y\mid \mu]$ +are user-provided. Our GLM script supports a standard set of +distributions and link functions, see below for details.</p> + +<h3 id="usage-2">Usage</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f GLM.dml + -nvargs X=<file> + Y=<file> + B=<file> + fmt=[format] + O=[file] + Log=[file] + dfam=[int] + vpow=[double] + link=[int] + lpow=[double] + yneg=[double] + icpt=[int] + reg=[double] + tol=[double] + disp=[double] + moi=[int] + mii=[int] +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f GLM.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=<file> + Y=<file> + B=<file> + fmt=[format] + O=[file] + Log=[file] + dfam=[int] + vpow=[double] + link=[int] + lpow=[double] + yneg=[double] + icpt=[int] + reg=[double] + tol=[double] + disp=[double] + moi=[int] + mii=[int] +</code></pre> + </div> +</div> + +<h3 id="arguments-for-spark-and-hadoop-invocation-2">Arguments for Spark and Hadoop invocation</h3> + +<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row +constitutes an example.</p> + +<p><strong>Y</strong>: Location to read the response matrix, which may have 1 or 2 columns</p> + +<p><strong>B</strong>: Location to store the estimated regression parameters (the $\beta_j$âs), +with the intercept parameter $\beta_0$ at position +B[$m\,{+}\,1$, 1] if available</p> + +<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, +<code>mm</code>, or <code>csv</code>; see read/write functions in +SystemML Language Reference for details.</p> + +<p><strong>O</strong>: (default: <code>" "</code>) Location to write certain summary statistics described +in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>, +by default it is standard output.</p> + +<p><strong>Log</strong>: (default: <code>" "</code>) Location to store iteration-specific variables for monitoring +and debugging purposes, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a> for details.</p> + +<p><strong>dfam</strong>: (default: <code>1</code>) Distribution family code to specify +$Prob[y\mid \mu]$, +see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>:</p> + +<ul> + <li>1 = power distributions with $Var(y) = \mu^{\alpha}$</li> + <li>2 = binomial or Bernoulli</li> +</ul> + +<p><strong>vpow</strong>: (default: <code>0.0</code>) When <code>dfam=1</code>, this provides the $q$ in +$Var(y) = a\mu^q$, the power +dependence of the variance of $y$ on its mean. In particular, use:</p> + +<ul> + <li>0.0 = Gaussian</li> + <li>1.0 = Poisson</li> + <li>2.0 = Gamma</li> + <li>3.0 = inverse Gaussian</li> +</ul> + +<p><strong>link</strong>: (default: <code>0</code>) Link function code to determine the link +function $\eta = g(\mu)$:</p> + +<ul> + <li>0 = canonical link (depends on the distribution family), +see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a></li> + <li>1 = power functions</li> + <li>2 = logit</li> + <li>3 = probit</li> + <li>4 = cloglog</li> + <li>5 = cauchit</li> +</ul> + +<p><strong>lpow</strong>: (default: <code>1.0</code>) When link=1, this provides the $s$ in +$\eta = \mu^s$, the power link function; <code>lpow=0.0</code> gives the +log link $\eta = \log\mu$. Common power links:</p> + +<ul> + <li>-2.0 = $1/\mu^2$</li> + <li>-1.0 = reciprocal</li> + <li>0.0 = log</li> + <li>0.5 = sqrt</li> + <li>1.0 = identity</li> +</ul> + +<p><strong>yneg</strong>: (default: <code>0.0</code>) When <code>dfam=2</code> and the response matrix $Y$ has +1 column, this specifies the $y$-value used for Bernoulli “No” label +(“Yes” is $1$): +0.0 when $y\in\{0, 1\}$; -1.0 when +$y\in\{-1, 1\}$</p> + +<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p> + +<ul> + <li>0 = no intercept (hence no $\beta_0$), no +shifting/rescaling of the features</li> + <li>1 = add intercept, but do not shift/rescale the features +in $X$</li> + <li>2 = add intercept, shift/rescale the features in $X$ to +mean 0, variance 1</li> +</ul> + +<p><strong>reg</strong>: (default: <code>0.0</code>) L2-regularization parameter ($\lambda$)</p> + +<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\varepsilon$) used in the convergence criterion: we +terminate the outer iterations when the deviance changes by less than +this factor; see below for details</p> + +<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or 0.0 to estimate it from +data</p> + +<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p> + +<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0 +if no maximum limit provided</p> + +<h3 id="examples-2">Examples</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f GLM.dml + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + dfam=2 + link=2 + yneg=-1.0 + icpt=2 + reg=0.01 + tol=0.00000001 + disp=1.0 + moi=100 + mii=10 + O=/user/ml/stats.csv + Log=/user/ml/log.csv +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f GLM.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + fmt=csv + dfam=2 + link=2 + yneg=-1.0 + icpt=2 + reg=0.01 + tol=0.00000001 + disp=1.0 + moi=100 + mii=10 + O=/user/ml/stats.csv + Log=/user/ml/log.csv +</code></pre> + </div> +</div> + +<hr /> + +<p><a name="table9"></a> +<strong>Table 9</strong>: Besides $\beta$, GLM regression script computes a few summary +statistics listed below. They are provided in CSV format, one +comma-separated name-value pair per each line.</p> + +<table> + <thead> + <tr> + <th>Name</th> + <th>Meaning</th> + </tr> + </thead> + <tbody> + <tr> + <td>TERMINATION_CODE</td> + <td>A positive integer indicating success/failure as follows: <br />1 = Converged successfully; 2 = Maximum # of iterations reached; 3 = Input (X, Y) out of range; 4 = Distribution/link not supported</td> + </tr> + <tr> + <td>BETA_MIN</td> + <td>Smallest beta value (regression coefficient), excluding the intercept</td> + </tr> + <tr> + <td>BETA_MIN_INDEX</td> + <td>Column index for the smallest beta value</td> + </tr> + <tr> + <td>BETA_MAX</td> + <td>Largest beta value (regression coefficient), excluding the intercept</td> + </tr> + <tr> + <td>BETA_MAX_INDEX</td> + <td>Column index for the largest beta value</td> + </tr> + <tr> + <td>INTERCEPT</td> + <td>Intercept value, or NaN if there is no intercept (if <code>icpt=0</code>)</td> + </tr> + <tr> + <td>DISPERSION</td> + <td>Dispersion used to scale deviance, provided in disp input argument or estimated (same as DISPERSION_EST) if disp argument is $\leq 0$</td> + </tr> + <tr> + <td>DISPERSION_EST</td> + <td>Dispersion estimated from the dataset</td> + </tr> + <tr> + <td>DEVIANCE_UNSCALED</td> + <td>Deviance from the saturated model, assuming dispersion $= 1.0$</td> + </tr> + <tr> + <td>DEVIANCE_SCALED</td> + <td>Deviance from the saturated model, scaled by DISPERSION value</td> + </tr> + </tbody> +</table> + +<hr /> + +<p><a name="table10"></a> +<strong>Table 10</strong>: The Log file for GLM regression contains the below +iteration variables in CSV format, each line containing a triple (Name, +Iteration#, Value) with Iteration# being 0 for initial values.</p> + +<table> + <thead> + <tr> + <th>Name</th> + <th>Meaning</th> + </tr> + </thead> + <tbody> + <tr> + <td>NUM_CG_ITERS</td> + <td>Number of inner (Conj. Gradient) iterations in this outer iteration</td> + </tr> + <tr> + <td>IS_TRUST_REACHED</td> + <td>1 = trust region boundary was reached, 0 = otherwise</td> + </tr> + <tr> + <td>POINT_STEP_NORM</td> + <td>L2-norm of iteration step from old point ($\beta$-vector) to new point</td> + </tr> + <tr> + <td>OBJECTIVE</td> + <td>The loss function we minimize (negative partial log-likelihood)</td> + </tr> + <tr> + <td>OBJ_DROP_REAL</td> + <td>Reduction in the objective during this iteration, actual value</td> + </tr> + <tr> + <td>OBJ_DROP_PRED</td> + <td>Reduction in the objective predicted by a quadratic approximation</td> + </tr> + <tr> + <td>OBJ_DROP_RATIO</td> + <td>Actual-to-predicted reduction ratio, used to update the trust region</td> + </tr> + <tr> + <td>GRADIENT_NORM</td> + <td>L2-norm of the loss function gradient (omitted if point is rejected)</td> + </tr> + <tr> + <td>LINEAR_TERM_MIN</td> + <td>The minimum value of $X$ %*% $\beta$, used to check for overflows</td> + </tr> + <tr> + <td>LINEAR_TERM_MAX</td> + <td>The maximum value of $X$ %*% $\beta$, used to check for overflows</td> + </tr> + <tr> + <td>IS_POINT_UPDATED</td> + <td>1 = new point accepted; 0 = new point rejected, old point restored</td> + </tr> + <tr> + <td>TRUST_DELTA</td> + <td>Updated trust region size, the “delta”</td> + </tr> + </tbody> +</table> + +<hr /> + +<p><a name="table11"></a> +<strong>Table 11</strong>: Common GLM distribution families and link functions. (Here “*” stands +for “any value.”)</p> + +<table> + <thead> + <tr> + <th style="text-align: center">dfam</th> + <th style="text-align: center">vpow</th> + <th style="text-align: center">link</th> + <th style="text-align: center">lpow</th> + <th style="text-align: center">Distribution<br />Family</th> + <th style="text-align: center">Link<br /> Function</th> + <th style="text-align: center">Canonical</th> + </tr> + </thead> + <tbody> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">0.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center">-1.0</td> + <td style="text-align: center">Gaussian</td> + <td style="text-align: center">inverse</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">0.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.0</td> + <td style="text-align: center">Gaussian</td> + <td style="text-align: center">log</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">0.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 1.0</td> + <td style="text-align: center">Gaussian</td> + <td style="text-align: center">identity</td> + <td style="text-align: center">Yes</td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">1.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.0</td> + <td style="text-align: center">Poisson</td> + <td style="text-align: center">log</td> + <td style="text-align: center">Yes</td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">1.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.5</td> + <td style="text-align: center">Poisson</td> + <td style="text-align: center">sq.root</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">1.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 1.0</td> + <td style="text-align: center">Poisson</td> + <td style="text-align: center">identity</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">2.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center">-1.0</td> + <td style="text-align: center">Gamma</td> + <td style="text-align: center">inverse</td> + <td style="text-align: center">Yes</td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">2.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.0</td> + <td style="text-align: center">Gamma</td> + <td style="text-align: center">log</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">2.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 1.0</td> + <td style="text-align: center">Gamma</td> + <td style="text-align: center">identity</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">3.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center">-2.0</td> + <td style="text-align: center">Inverse Gauss</td> + <td style="text-align: center">$1/\mu^2$</td> + <td style="text-align: center">Yes</td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">3.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center">-1.0</td> + <td style="text-align: center">Inverse Gauss</td> + <td style="text-align: center">inverse</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">3.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.0</td> + <td style="text-align: center">Inverse Gauss</td> + <td style="text-align: center">log</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">1</td> + <td style="text-align: center">3.0</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 1.0</td> + <td style="text-align: center">Inverse Gauss</td> + <td style="text-align: center">identity</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.0</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">log</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">1</td> + <td style="text-align: center"> 0.5</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">sq.root</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">2</td> + <td style="text-align: center"> *</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">logit</td> + <td style="text-align: center">Yes</td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">3</td> + <td style="text-align: center"> *</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">probit</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">4</td> + <td style="text-align: center"> *</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">cloglog</td> + <td style="text-align: center"> </td> + </tr> + <tr> + <td style="text-align: center">2</td> + <td style="text-align: center">*</td> + <td style="text-align: center">5</td> + <td style="text-align: center"> *</td> + <td style="text-align: center">Binomial</td> + <td style="text-align: center">cauchit</td> + <td style="text-align: center"> </td> + </tr> + </tbody> +</table> + +<hr /> + +<p><a name="table12"></a> +<strong>Table 12</strong>: The supported non-power link functions for the Bernoulli and the +binomial distributions. Here $\mu$ is the Bernoulli mean.</p> + +<table> + <thead> + <tr> + <th>Name</th> + <th>Link Function</th> + </tr> + </thead> + <tbody> + <tr> + <td>Logit</td> + <td>$\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$</td> + </tr> + <tr> + <td>Probit</td> + <td><script type="math/tex">\displaystyle \mu = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt</script></td> + </tr> + <tr> + <td>Cloglog</td> + <td>$\displaystyle \eta = \log \big(- \log(1 - \mu)\big)^{\mathstrut}$</td> + </tr> + <tr> + <td>Cauchit</td> + <td>$\displaystyle \eta = \tan\pi(\mu - 1/2)$</td> + </tr> + </tbody> +</table> + +<hr /> + +<h3 id="details-2">Details</h3> + +<p>In GLM, the noise distribution +$Prob[y\mid \mu]$ +of the response variable $y$ given its mean $\mu$ is restricted to have +the <em>exponential family</em> form</p> + +<script type="math/tex; mode=display">\begin{equation} +Y \sim\, Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a} ++ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = E(Y) = b'(\theta). +\end{equation}</script> + +<p>Changing the mean in such a distribution simply +multiplies all +$Prob[y\mid \mu]$ +by $e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so +that they again integrate to 1. Parameter $\theta$ is called +<em>canonical</em>, and the function $\theta = b’^{\,-1}(\mu)$ that relates it +to the mean is called the <em>canonical link</em>; constant $a$ is called +<em>dispersion</em> and rescales the variance of $y$. Many common distributions +can be put into this form, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>. The canonical +parameter $\theta$ is often chosen to coincide with $\eta$, the linear +combination of the regression features; other choices for $\eta$ are +possible too.</p> + +<p>Rather than specifying the canonical link, GLM distributions are +commonly defined by their variance +$Var(y)$ as the function of +the mean $\mu$. It can be shown from Eq. 5 that +$Var(y) = a\,b’’(\theta) = a\,b’‘(b’^{\,-1}(\mu))$. +For example, for the Bernoulli distribution +$Var(y) = \mu(1-\mu)$, for the +Poisson distribution +$Var(y) = \mu$, and for the Gaussian distribution +$Var(y) = a\cdot 1 = \sigma^2$. +It turns out that for many common distributions +$Var(y) = a\mu^q$, a power +function. We support all distributions where +$Var(y) = a\mu^q$, as well as +the Bernoulli and the binomial distributions.</p> + +<p>For distributions with +$Var(y) = a\mu^q$ the +canonical link is also a power function, namely +$\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose +canonical link is $\theta = \log\mu$. We support all power link +functions in the form $\eta = \mu^s$, dropping any constant factor, with +$\eta = \log\mu$ for $s=0$. The binomial distribution has its own family +of link functions, which includes logit (the canonical link), probit, +cloglog, and cauchit (see <a href="algorithms-regression.html#table12"><strong>Table 12</strong></a>); we support +these only for the binomial and Bernoulli distributions. Links and +distributions are specified via four input parameters: +<code>dfam</code>, <code>vpow</code>, <code>link</code>, and +<code>lpow</code> (see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p> + +<p>The observed response values are provided to the regression script as a +matrix $Y$ having 1 or 2 columns. If a power distribution family is +selected (<code>dfam=1</code>), matrix $Y$ must have 1 column that +provides $y_i$ for each $x_i$ in the corresponding row of matrix $X$. +When dfam=2 and $Y$ has 1 column, we assume the Bernoulli +distribution for <script type="math/tex">y_i\in\{y_{\mathrm{neg}}, 1\}</script> with $y_{\mathrm{neg}}$ +from the input parameter <code>yneg</code>. When <code>dfam=2</code> and +$Y$ has 2 columns, we assume the binomial distribution; for each row $i$ +in $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the +negative binomial counts respectively. Internally we convert the +1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.</p> + +<p>We estimate the regression parameters via L2-regularized negative +log-likelihood minimization:</p> + +<script type="math/tex; mode=display">f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big) +\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min</script> + +<p>where +$\theta_i$ and $b(\theta_i)$ are from (6); note that $a$ and +$c(y, a)$ are constant w.r.t. $\beta$ and can be ignored here. The +canonical parameter $\theta_i$ depends on both $\beta$ and $x_i$:</p> + +<script type="math/tex; mode=display">\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\, +\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)</script> + +<p>The user-provided (via <code>reg</code>) regularization coefficient +$\lambda\geq 0$ can be used to mitigate overfitting and degeneracy in +the data. Note that the intercept is never regularized.</p> + +<p>Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring +approximation to the difference +$\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$, +recomputed at each iteration:</p> + +<script type="math/tex; mode=display">\begin{gathered} +\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z, +\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!diag(w) X \,+\, \lambda I\\ +\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta, +\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\ +\forall\,i = 1\ldots n: \,\,\,\, +w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1} +\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\, +u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1} +\!\!\!\!\!\!.\,\,\,\,\end{gathered}</script> + +<p>Here +$v(\mu_i)=Var(y_i)/a$, the +variance of $y_i$ as the function of the mean, and +$g’(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative. The +Fisher scoring approximation is minimized by trust-region conjugate +gradient iterations (called the <em>inner</em> iterations, with the Fisher +scoring iterations as the <em>outer</em> iterations), which approximately solve +the following problem:</p> + +<script type="math/tex; mode=display">1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\, +\|z\|_2 \leq \delta</script> + +<p>The conjugate gradient algorithm closely follows +Algorithm 7.2 on page 171 of <a href="algorithms-bibliography.html">[Nocedal2006]</a>. The trust region +size $\delta$ is initialized as +$0.5\sqrt{m}\,/ \max\nolimits_i |x_i|_2$ and updated as described +in <a href="algorithms-bibliography.html">[Nocedal2006]</a>. +The user can specify the maximum number of +the outer and the inner iterations with input parameters +<code>moi</code> and <code>mii</code>, respectively. The Fisher scoring +algorithm terminates successfully if +$2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$ +where ${\varepsilon}> 0$ is a tolerance supplied by the user via +<code>tol</code>, and $D_1(\beta)$ is the unit-dispersion deviance +estimated as</p> + +<script type="math/tex; mode=display">D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y \mid \! +\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1] +\,\,-\,\,Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)</script> + +<p>The deviance estimate is also produced as part of the output. Once the +Fisher scoring algorithm terminates, if requested by the user, we +estimate the dispersion $a$ from (6) using Pearson residuals</p> + +<script type="math/tex; mode=display">\begin{equation} +\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)} +\end{equation}</script> + +<p>and use it to adjust our deviance estimate: +$D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument +disp is 0.0 we estimate $\hat{a}$, otherwise +we use its value as $a$. Note that in (7) $m$ counts +the intercept ($m \leftarrow m+1$) if it is present.</p> + +<h3 id="returns-2">Returns</h3> + +<p>The estimated regression parameters (the $\hat{\beta}_j$âs) are +populated into a matrix and written to an HDFS file whose path/name was +provided as the <code>B</code> input argument. What this matrix +contains, and its size, depends on the input argument <code>icpt</code>, +which specifies the userâs intercept and rescaling choice:</p> + +<ul> + <li><strong>icpt=0</strong>: No intercept, matrix $B$ has size $m\,{\times}\,1$, with +$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to $m = {}$ncol$(X)$.</li> + <li><strong>icpt=1</strong>: There is intercept, but no shifting/rescaling of $X$; matrix $B$ has +size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from +1 to $m$, and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept +coefficient.</li> + <li><strong>icpt=2</strong>: There is intercept, and the features in $X$ are shifted to mean${} = 0$ +and rescaled to variance${} = 1$; then there are two versions of +the $\hat{\beta}_j$âs, one for the original features and another for the +shifted/rescaled features. Now matrix $B$ has size +$(m\,{+}\,1) \times 2$, with $B[\cdot, 1]$ for the original features and +$B[\cdot, 2]$ for the shifted/rescaled features, in the above format. +Note that $B[\cdot, 2]$ are iteratively estimated and $B[\cdot, 1]$ are +obtained from $B[\cdot, 2]$ by complementary shifting and rescaling.</li> +</ul> + +<p>Our script also estimates the dispersion $\hat{a}$ (or takes it from the +userâs input) and the deviances $D_1(\hat{\beta})$ and +$D_{\hat{a}}(\hat{\beta})$, see <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> for details. A +log file with variables monitoring progress through the iterations can +also be made available, see <a href="algorithms-regression.html#table10"><strong>Table 10</strong></a>.</p> + +<h3 id="see-also">See Also</h3> + +<p>In case of binary classification problems, consider using L2-SVM or +binary logistic regression; for multiclass classification, use +multiclass SVM or multinomial logistic regression. For the special cases +of linear regression and logistic regression, it may be more efficient +to use the corresponding specialized scripts instead of GLM.</p> + +<hr /> + +<h2 id="stepwise-generalized-linear-regression">4.4. Stepwise Generalized Linear Regression</h2> + +<h3 id="description-3">Description</h3> + +<p>Our stepwise generalized linear regression script selects a model based +on the Akaike information criterion (AIC): the model that gives rise to +the lowest AIC is provided. Note that currently only the Bernoulli +distribution family is supported (see below for details).</p> + +<h3 id="usage-3">Usage</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f StepGLM.dml + -nvargs X=<file> + Y=<file> + B=<file> + S=[file] + O=[file] + link=[int] + yneg=[double] + icpt=[int] + tol=[double] + disp=[double] + moi=[int] + mii=[int] + thr=[double] + fmt=[format] +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f StepGLM.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=<file> + Y=<file> + B=<file> + S=[file] + O=[file] + link=[int] + yneg=[double] + icpt=[int] + tol=[double] + disp=[double] + moi=[int] + mii=[int] + thr=[double] + fmt=[format] +</code></pre> + </div> +</div> + +<h3 id="arguments-for-spark-and-hadoop-invocation-3">Arguments for Spark and Hadoop invocation</h3> + +<p><strong>X</strong>: Location (on HDFS) to read the matrix of feature vectors; each row is an +example.</p> + +<p><strong>Y</strong>: Location (on HDFS) to read the response matrix, which may have 1 or 2 +columns</p> + +<p><strong>B</strong>: Location (on HDFS) to store the estimated regression parameters (the +$\beta_j$âs), with the intercept parameter $\beta_0$ at position +B[$m\,{+}\,1$, 1] if available</p> + +<p><strong>S</strong>: (default: <code>" "</code>) Location (on HDFS) to store the selected feature-ids in the +order as computed by the algorithm, by default it is standard output.</p> + +<p><strong>O</strong>: (default: <code>" "</code>) Location (on HDFS) to write certain summary statistics +described in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a>, by default it is standard +output.</p> + +<p><strong>link</strong>: (default: <code>2</code>) Link function code to determine the link +function $\eta = g(\mu)$, see <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>; currently the +following link functions are supported:</p> + +<ul> + <li>1 = log</li> + <li>2 = logit</li> + <li>3 = probit</li> + <li>4 = cloglog</li> +</ul> + +<p><strong>yneg</strong>: (default: <code>0.0</code>) Response value for Bernoulli “No” label, usually 0.0 or -1.0</p> + +<p><strong>icpt</strong>: (default: <code>0</code>) Intercept and shifting/rescaling of the features in $X$:</p> + +<ul> + <li>0 = no intercept (hence no $\beta_0$), no +shifting/rescaling of the features</li> + <li>1 = add intercept, but do not shift/rescale the features +in $X$</li> + <li>2 = add intercept, shift/rescale the features in $X$ to +mean 0, variance 1</li> +</ul> + +<p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance ($\epsilon$) used in the convergence criterion: we +terminate the outer iterations when the deviance changes by less than +this factor; see below for details.</p> + +<p><strong>disp</strong>: (default: <code>0.0</code>) Dispersion parameter, or <code>0.0</code> to estimate it from +data</p> + +<p><strong>moi</strong>: (default: <code>200</code>) Maximum number of outer (Fisher scoring) iterations</p> + +<p><strong>mii</strong>: (default: <code>0</code>) Maximum number of inner (conjugate gradient) iterations, or 0 +if no maximum limit provided</p> + +<p><strong>thr</strong>: (default: <code>0.01</code>) Threshold to stop the algorithm: if the decrease in the value +of the AIC falls below <code>thr</code> no further features are being +checked and the algorithm stops.</p> + +<p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, +<code>mm</code>, or <code>csv</code>; see read/write functions in +SystemML Language Reference for details.</p> + +<h3 id="examples-3">Examples</h3> + +<div class="codetabs"> +<div data-lang="Hadoop"> + <pre><code>hadoop jar SystemML.jar -f StepGLM.dml + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + S=/user/ml/selected.csv + O=/user/ml/stats.csv + link=2 + yneg=-1.0 + icpt=2 + tol=0.000001 + moi=100 + mii=10 + thr=0.05 + fmt=csv +</code></pre> + </div> +<div data-lang="Spark"> + <pre><code>$SPARK_HOME/bin/spark-submit --master yarn + --deploy-mode cluster + --conf spark.driver.maxResultSize=0 + SystemML.jar + -f StepGLM.dml + -config SystemML-config.xml + -exec hybrid_spark + -nvargs X=/user/ml/X.mtx + Y=/user/ml/Y.mtx + B=/user/ml/B.mtx + S=/user/ml/selected.csv + O=/user/ml/stats.csv + link=2 + yneg=-1.0 + icpt=2 + tol=0.000001 + moi=100 + mii=10 + thr=0.05 + fmt=csv +</code></pre> + </div> +</div> + +<h3 id="details-3">Details</h3> + +<p>Similar to <code>StepLinearRegDS.dml</code> our stepwise GLM script +builds a model by iteratively selecting predictive variables using a +forward selection strategy based on the AIC (5). Note that +currently only the Bernoulli distribution family (<code>fam=2</code> in +<a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>) together with the following link functions +are supported: <code>log</code>, <code>logit</code>, <code>probit</code>, and <code>cloglog</code> (link +<script type="math/tex">\in\{1,2,3,4\}</script> in <a href="algorithms-regression.html#table11"><strong>Table 11</strong></a>).</p> + +<h3 id="returns-3">Returns</h3> + +<p>Similar to the outputs from <code>GLM.dml</code> the stepwise GLM script +computes the estimated regression coefficients and stores them in matrix +$B$ on HDFS; matrix $B$ follows the same format as the one produced by +<code>GLM.dml</code> (see <a href="algorithms-regression.html#generalized-linear-models">Generalized Linear Models</a>). Additionally, +<code>StepGLM.dml</code> outputs the variable indices (stored in the +1-column matrix $S$) in the order they have been selected by the +algorithm, i.e., $i$th entry in matrix $S$ stores the variable which +improves the AIC the most in $i$th iteration. If the model with the +lowest AIC includes no variables matrix $S$ will be empty. Moreover, the +estimated summary statistics as defined in <a href="algorithms-regression.html#table9"><strong>Table 9</strong></a> are +printed out or stored in a file on HDFS (if requested); these statistics +will be provided only if the selected model is nonempty, i.e., contains +at least one variable.</p> + +<hr /> + +<h2 id="regression-scoring-and-prediction">4.5. Regression Scoring and Prediction</h2> + +<h3 id="description-4">Description</h3> + +<p>Script <code>GLM-predict.dml</code> is intended to cover all linear +model based regressions, including linear regression, binomial and +multinomial logistic regression, and GLM regressions (Poisson, gamma, +binomial with probit link etc.). Having just one scoring script for all +these regressions simplifies maintenance and enhancement while ensuring +compatible interpretations for output statistics.</p> + +<p>The script performs two functions, prediction and scoring. To perform +prediction, the script takes two matrix inputs: a collection of records +$X$ (without the response attribute) and the estimated regression +parameters $B$, also known as $\beta$. To perform scoring, in addition +to $X$ and $B$, the script takes the matrix of actual response +values $Y$ that are compared to the predictions made with $X$ and $B$. +Of course there are other, non-matrix, input arguments that specify the +model and the output format, see below for the full list.</p> + +<p>We assume that our test/scoring dataset is given by +$n\,{\times}\,m$-matrix $X$ of numerical feature vectors, where each +row $x_i$ represents one feature vector of one record; we have $\dim x_i = m$. Each +record also includes the response variable $y_i$ that may be numerical, +single-label categorical, or multi-label categorical. A single-label +categorical $y_i$ is an integer category label, one label per record; a +multi-label $y_i$ is a vector of integer counts, one count for each +possible label, which represents multiple single-label events +(observations) for the same $x_i$. Internally we convert single-label
[... 939 lines stripped ...]
