[8/9] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

deron Fri, 10 Mar 2017 17:17:15 -0800

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms
 Reference/DescriptiveBivarStats.tex
----------------------------------------------------------------------
diff --git a/docs/Algorithms Reference/DescriptiveBivarStats.tex 
b/docs/Algorithms Reference/DescriptiveBivarStats.tex
deleted file mode 100644
index a2d3db1..0000000
--- a/docs/Algorithms Reference/DescriptiveBivarStats.tex       
+++ /dev/null
@@ -1,438 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Bivariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-Bivariate statistics are used to quantitatively describe the association 
between
-two features, such as test their statistical (in-)dependence or measure
-the accuracy of one data feature predicting the other feature, in a sample.
-The \BivarScriptName{} script computes common bivariate statistics,
-such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
-of data features.  For a given dataset matrix, script \BivarScriptName{} 
computes
-certain bivariate statistics for the given feature (column) pairs in the
-matrix.  The feature types govern the exact set of statistics computed for 
that pair.
-For example, \NameStatR{} can only be computed on two quantitative (scale)
-features like `Height' and `Temperature'. 
-It does not make sense to compute the linear correlation of two categorical 
attributes
-like `Hair Color'. 
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%\tolerance=0
-{\tt{}-f }path/\/\BivarScriptName{}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} index1=}path/file
-{\tt{} index2=}path/file
-{\tt{} types1=}path/file
-{\tt{} types2=}path/file
-{\tt{} OUTDIR=}path
-% {\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the data matrix $X$ whose columns are the features
-that we want to compare and correlate with bivariate statistics.
-\item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the column indices
-of the \emph{first-argument} features in pairwise statistics.
-Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
-index $k$ of column \texttt{X[,$\,k$]} in the data matrix
-whose bivariate statistics need to be computed.
-% The default value means ``use all $X$-columns from the first to the last.''
-\item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the column indices
-of the \emph{second-argument} features in pairwise statistics.
-Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
-index $l$ of column \texttt{X[,$\,l$]} in the data matrix
-whose bivariate statistics need to be computed.
-% The default value means ``use all $X$-columns from the first to the last.''
-\item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the \emph{types}
-of the \emph{first-argument} features in pairwise statistics.
-Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the 
type
-of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the 
$i^{\textrm{th}}$
-entry in the {\tt index1} matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all referenced $X$-columns as scale.''
-\item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the \emph{types}
-of the \emph{second-argument} features in pairwise statistics.
-Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the 
type
-of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the 
$j^{\textrm{th}}$
-entry in the {\tt index2} matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all referenced $X$-columns as scale.''
-\item[{\tt OUTDIR}:]
-Location path (on HDFS) where the output matrices with computed bivariate
-statistics will be stored.  The matrices' file names and format are defined
-in Table~\ref{table:bivars}.
-% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-% see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|lll|}
-\hline\rule{0pt}{12pt}%
-Ouput File / Matrix         & Row$\,$\# & Name of Statistic   \\[2pt]
-\hline\hline\rule{0pt}{12pt}%
-\emph{All Files}            &     1     & 1-st feature column \\
-\rule{1em}{0pt}"            &     2     & 2-nd feature column \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.scale.scale.stats     &     3     & \NameStatR          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.nominal.nominal.stats &     3     & \NameStatChi        \\
-\rule{1em}{0pt}"            &     4     & Degrees of freedom  \\
-\rule{1em}{0pt}"            &     5     & \NameStatPChi       \\
-\rule{1em}{0pt}"            &     6     & \NameStatV          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.nominal.scale.stats   &     3     & \NameStatEta        \\
-\rule{1em}{0pt}"            &     4     & \NameStatF          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.ordinal.ordinal.stats &     3     & \NameStatRho        \\[2pt]
-\hline
-\end{tabular}\hfil
-\caption{%
-The output matrices of \BivarScriptName{} have one row per one bivariate
-statistic and one column per one pair of input features.  This table lists
-the meaning of each matrix and each row.%
-% Signs ``+'' show applicability to scale or/and to categorical features.
-}
-\label{table:bivars}
-\end{table}
-
-
-
-\pagebreak[2]
-
-\noindent{\bf Details}
-\smallskip
-
-Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns 
represent
-the features and whose rows represent the records of a data sample.
-Given \texttt{X}, the script computes certain relevant bivariate statistics
-for specified pairs of feature columns \texttt{X[,$\,i$]} and 
\texttt{X[,$\,j$]}.
-Command-line parameters \texttt{index1} and \texttt{index2} specify the files 
with
-column pairs of interest to the user.  Namely, the file given by 
\texttt{index1}
-contains the vector of the 1st-attribute column indices and the file given
-by \texttt{index2} has the vector of the 2nd-attribute column indices, with
-``1st'' and ``2nd'' referring to their places in bivariate statistics.
-Note that both \texttt{index1} and \texttt{index2} files should contain a 
1-row matrix
-of positive integers.
-
-The bivariate statistics to be computed depend on the \emph{types}, or
-\emph{measurement levels}, of the two columns.
-The types for each pair are provided in the files whose locations are 
specified by
-\texttt{types1} and \texttt{types2} command-line parameters.
-These files are also 1-row matrices, i.e.\ vectors, that list the 
1st-attribute and
-the 2nd-attribute column types in the same order as their indices in the
-\texttt{index1} and \texttt{index2} files.  The types must be provided as per
-the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-
-The script orgainizes its results into (potentially) four output matrices, one 
per
-each type combination.  The types of bivariate statistics are defined using 
the types
-of the columns that were used for their arguments, with ``ordinal'' sometimes
-retrogressing to ``nominal.''  Table~\ref{table:bivars} describes what each 
column
-in each output matrix contains.  In particular, the script includes the 
following
-statistics:
-\begin{Itemize}
-\item For a pair of scale (quantitative) columns, \NameStatR;
-\item For a pair of nominal columns (with finite-sized, fixed, unordered 
domains), 
-the \NameStatChi{} and its p-value;
-\item For a pair of one scale column and one nominal column, \NameStatF{};
-\item For a pair of ordinal columns (ordered domains depicting ranks), 
\NameStatRho.
-\end{Itemize}
-Note that, as shown in Table~\ref{table:bivars}, the output matrices contain 
the
-column indices of the features involved in each statistic.
-Moreover, if the output matrix does not contain
-a value in a certain cell then it should be interpreted as a~$0$
-(sparse matrix representation).
-
-Below we list all bivariate statistics computed by script \BivarScriptName.
-The statistics are collected into several groups by the type of their input
-features.  We refer to the two input features as $v_1$ and $v_2$ unless
-specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for 
$i=1,\ldots,n$,
-where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.
-
-
-\paragraph{Scale-vs-scale statistics.}
-Sample statistics that describe association between two quantitative (scale) 
features.
-A scale feature has numerical values, with the natural ordering relation.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatR]:
-A measure of linear dependence between two numerical features:
-\begin{equation*}
-r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
-\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
-{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n 
(v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
-\end{equation*}
-Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching 
${\pm}1$ when all value
-pairs $(v_{1,i}, v_{2,i})$ lie on the same line.  Correlation near~0 means 
that a line is not a good
-way to represent the dependence between the two features; however, this does 
not imply independence.
-The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if 
one feature tends to
-linearly increase (decrease) when the other feature increases.  Nonlinear 
association, if present,
-may disobey this sign.
-\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if 
we transform $v_1$ and $v_2$
-to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
-
-Suppose that we use simple linear regression to represent one feature given 
the other, say
-represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and 
$\beta$
-to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta 
v_{1,i})^2$.
-Then the best error equals
-\begin{equation*}
-\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta 
v_{1,i}\big)^2 \,\,=\,\,
-(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
-\end{equation*}
-In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
-the total sum of squares.  Hence, $r^2$ is an accuracy measure of the linear 
regression.
-\end{Description}
-
-
-\paragraph{Nominal-vs-nominal statistics.}
-Sample statistics that describe association between two nominal categorical 
features.
-Both features' value domains are encoded with positive integers in arbitrary 
order:
-nominal features do not order their value domains.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatChi]:
-A measure of how much the frequencies of value pairs of two categorical 
features deviate from
-statistical independence.  Under independence, the probability of every value 
pair must equal
-the product of probabilities of each value in the pair:
-$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$.  But we do not know these 
(hypothesized) probabilities;
-we only know the sample frequency counts.  Let $n_{a,b}$ be the frequency 
count of pair
-$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of 
$b$~alone.  Under
-independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be 
exactly~0 due
-to sample randomness, yet it is unlikely to be too far from~0.  For some pairs 
$(a,b)$ it may
-deviate from~0 farther than for other pairs.  \NameStatChi{}~is an aggregate 
measure that
-combines squares of these differences across all value pairs:
-\begin{equation*}
-\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - 
\frac{n_a n_b}{n}\Big)^2
-\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
-\end{equation*}
-where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = 
(n_a n_b){/}n$ are
-the \emph{expected} frequencies for all pairs~$(a,b)$.  Under independence 
(plus other standard
-assumptions) the sample~$\chi^2$ closely follows a well-known distribution, 
making it a basis for
-statistical tests for independence, see~\emph{\NameStatPChi} for details.  
Note that \NameStatChi{}
-does \emph{not} measure the strength of dependence: even very weak dependence 
may result in a
-significant deviation from independence if the counts are large enough.  
Use~\NameStatV{} instead
-to measure the strength of dependence.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Degrees of freedom]:
-An integer parameter required for the interpretation of~\NameStatChi{} 
measure.  Under independence
-(plus other standard assumptions) the sample~$\chi^2$ statistic is 
approximately distributed as the
-sum of $d$~squares of independent normal random variables with mean~0 and 
variance~1, where $d$ is
-this integer parameter.  For a pair of categorical features such that the 
$1^{\textrm{st}}$~feature
-has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, 
the number of degrees
-of freedom is $d = (k_1 - 1)(k_2 - 1)$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatPChi]:
-A measure of how likely we would observe the current frequencies of value 
pairs of two categorical
-features assuming their statistical independence.  More precisely, it computes 
the probability that
-the sum of $d$~squares of independent normal random variables with mean~0 and 
variance~1
-(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a 
value at least as large
-as the current sample \NameStatChi.  The $d$ parameter is \emph{degrees of 
freedom}, see above.
-Under independence (plus other standard assumptions) the sample \NameStatChi{} 
closely follows the
-$\chi^2$~distribution and is unlikely to land very far into its tail.  On the 
other hand, if the
-two features are dependent, their sample \NameStatChi{} becomes arbitrarily 
large as $n\to\infty$
-and lands extremely far into the tail of the $\chi^2$~distribution given a 
large enough data sample.
-\NameStatPChi{} returns the tail ``weight'' on the right-hand side of 
\NameStatChi:
-\begin{equation*}
-P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim 
\textrm{the $\chi^2$ distribution}\big]
-\end{equation*}
-As any probability, $P$ ranges between 0 and~1.  If $P\leq 0.05$, the 
dependence between the two
-features may be considered statistically significant (i.e.\ their independence 
is considered
-statistically ruled out).  For highly dependent features, it is not unusual to 
have $P\leq 10^{-20}$
-or less, in which case our script will simply return $P = 0$.  Independent 
features should have
-their $P\geq 0.05$ in about 95\% cases.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatV]:
-A measure for the strength of association, i.e.\ of statistical dependence, 
between two categorical
-features, conceptually similar to \NameStatR.  It divides the 
observed~\NameStatChi{} by the maximum
-possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of 
categories in each feature,
-then takes the square root.  Thus, \NameStatV{} ranges from 0 to~1,
-where 0 implies no association and 1 implies the maximum possible association 
(one-to-one
-correspondence) between the two features.  See \emph{\NameStatChi} for the 
computation of~$\chi^2$;
-its maximum${} = {}$%
-$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
-has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ 
categories~\cite{AcockStavig1979:CramersV},
-so
-\begin{equation*}
-\textrm{\NameStatV} \,\,=\,\, 
\sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
-\end{equation*}
-As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' 
dependence increases,
-\NameStatV{} goes towards~1 (slowly) as the dependence increases.  Both 
\NameStatChi{} and
-\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is 
mitigated by taking the
-ratio.
-\end{Description}
-
-
-\paragraph{Nominal-vs-scale statistics.}
-Sample statistics that describe association between a categorical feature
-(order ignored) and a quantitative (scale) feature.
-The values of the categorical feature must be coded as positive integers.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatEta]:
-A measure for the strength of association (statistical dependence) between a 
nominal feature
-and a scale feature, conceptually similar to \NameStatR.  Ranges from 0 to~1, 
approaching 0
-when there is no association and approaching 1 when there is a strong 
association.  
-The nominal feature, treated as the independent variable, is assumed to have 
relatively few
-possible values, all with large frequency counts.  The scale feature is 
treated as the dependent
-variable.  Denoting the nominal feature by~$x$ and the scale feature by~$y$, 
we have:
-\begin{equation*}
-\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - 
\hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
-\,\,\,\,\textrm{where}\,\,\,\,
-\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n  
-\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & 
\textrm{otherwise}\end{array}\right.\!\!\!
-\end{equation*}
-and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean.  Value $\hat{y}[x]$ is 
the average 
-of~$y_i$ among all records where $x_i = x$; it can also be viewed as the 
``predictor'' 
-of $y$ given~$x$.  Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the 
residual error
-sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total 
sum-of-squares for~$y$. 
-Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
-``R-squared'' statistic measures the accuracy of linear regression.  Our 
output $\eta$
-is the square root of~$\eta^2$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatF]:
-A measure of how much the values of the scale feature, denoted here by~$y$,
-deviate from statistical independence on the nominal feature, denoted by~$x$.
-The same measure appears in the one-way analysis of vari\-ance (ANOVA).
-Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
-$y$~is independent from~$x$, given the following assumptions:
-\begin{Itemize}
-\item The scale feature $y$ has approximately normal distribution whose mean
-may depend only on~$x$ and variance is the same for all~$x$;
-\item The nominal feature $x$ has relatively small value domain with large
-frequency counts, the $x_i$-values are treated as fixed (non-random);
-\item All records are sampled independently of each other.
-\end{Itemize}
-To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
-among all records where $x_i = x$.  These $\hat{y}[x]$ can be viewed as
-``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
-``predict'' only the global mean~$\bar{y}$.  Then we form two sums-of-squares:
-\begin{Itemize}
-\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - 
\hat{y}[x_i]$;
-\item \emph{Explained} sum-of-squares of the ``predictor'' variability: 
$\hat{y}[x_i] - \bar{y}$.
-\end{Itemize}
-\NameStatF{} is the ratio of the explained sum-of-squares to
-the residual sum-of-squares, each divided by their corresponding degrees
-of freedom:
-\begin{equation*}
-F \,\,=\,\, 
-\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 
\,\big/\,\, (k\,{-}\,1)}%
-{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} 
\,\,=\,\,
-\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
-\end{equation*}
-Here $k$ is the domain size of the nominal feature~$x$.  The $k$ ``predictors''
-lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
-the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.
-
-The statistic can test if the independence hypothesis of $y$ from $x$ is 
reasonable;
-more generally (with relaxed normality assumptions) it can test the hypothesis 
that
-\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
-Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, 
n\,{-}\,k)$-distribution.
-But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
-becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely 
far
-into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large 
enough data sample.
-\end{Description}
-
-
-\paragraph{Ordinal-vs-ordinal statistics.}
-Sample statistics that describe association between two ordinal categorical 
features.
-Both features' value domains are encoded with positive integers, so that the 
natural
-order of the integers coincides with the order in each value domain.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatRho]:
-A measure for the strength of association (statistical dependence) between
-two ordinal features, conceptually similar to \NameStatR.  Specifically, it is 
\NameStatR{}
-applied to the feature vectors in which all values are replaced by their 
ranks, i.e.\ 
-their positions if the vector is sorted.  The ranks of identical (duplicate) 
values
-are replaced with their average rank.  For example, in vector
-$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the 
sorted
-order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their 
average
-rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 
2,\, 5,\, 3.5,\, 1)$.
-
-Our implementation of \NameStatRho{} is geared towards features having small 
value domains
-and large counts for the values.  Given the two input vectors, we form a 
contingency table $T$
-of pairwise frequency counts, as well as a vector of frequency counts for each 
feature: $f_1$
-and~$f_2$.  Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer 
to the
-order-preserving integer encoding of the feature values.
-We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
-$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously 
for~$r_2$.
-Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, 
f_2$ and their
-covariance weighted by~$T$, before applying the standard formula for 
\NameStatR:
-\begin{equation*}
-\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
-\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - 
\bar{r}_2)}%
-{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} 
(r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
-\end{equation*}
-where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
-The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the 
prevalent direction
-of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to 
increase (decrease)
-when the other feature increases.  The correlation becomes~1 when the two 
features are
-monotonically related.
-\end{Description}
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-A collection of (potentially) 4 matrices.  Each matrix contains bivariate 
statistics that
-resulted from a different combination of feature types.  There is one matrix 
for scale-scale
-statistics (which includes \NameStatR), one for nominal-nominal statistics 
(includes \NameStatChi{}),
-one for nominal-scale statistics (includes \NameStatF) and one for 
ordinal-ordinal statistics
-(includes \NameStatRho).  If any of these matrices is not produced, then no 
pair of columns required
-the corresponding type combination.  See Table~\ref{table:bivars} for the 
matrix naming and
-format details.
-
-
-\smallskip
-\pagebreak[2]
-
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f \BivarScriptName{} -nvargs
-X=/user/biadmin/X.mtx 
-index1=/user/biadmin/S1.mtx 
-index2=/user/biadmin/S2.mtx 
-types1=/user/biadmin/K1.mtx 
-types2=/user/biadmin/K2.mtx 
-OUTDIR=/user/biadmin/stats.mtx
-
-}
-


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms
 Reference/DescriptiveStats.tex
----------------------------------------------------------------------
diff --git a/docs/Algorithms Reference/DescriptiveStats.tex b/docs/Algorithms 
Reference/DescriptiveStats.tex
deleted file mode 100644
index 5a59ad4..0000000
--- a/docs/Algorithms Reference/DescriptiveStats.tex    
+++ /dev/null
@@ -1,115 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
-\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}
-
-\newcommand{\OutputRowIDMinimum}{1}
-\newcommand{\OutputRowIDMaximum}{2}
-\newcommand{\OutputRowIDRange}{3}
-\newcommand{\OutputRowIDMean}{4}
-\newcommand{\OutputRowIDVariance}{5}
-\newcommand{\OutputRowIDStDeviation}{6}
-\newcommand{\OutputRowIDStErrorMean}{7}
-\newcommand{\OutputRowIDCoeffVar}{8}
-\newcommand{\OutputRowIDQuartiles}{?, 13, ?}
-\newcommand{\OutputRowIDMedian}{13}
-\newcommand{\OutputRowIDIQMean}{14}
-\newcommand{\OutputRowIDSkewness}{9}
-\newcommand{\OutputRowIDKurtosis}{10}
-\newcommand{\OutputRowIDStErrorSkewness}{11}
-\newcommand{\OutputRowIDStErrorCurtosis}{12}
-\newcommand{\OutputRowIDNumCategories}{15}
-\newcommand{\OutputRowIDMode}{16}
-\newcommand{\OutputRowIDNumModes}{17}
-\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}
-
-\newcommand{\NameStatR}{Pearson's correlation coefficient}
-\newcommand{\NameStatChi}{Pearson's~$\chi^2$}
-\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
-\newcommand{\NameStatV}{Cram\'er's~$V$}
-\newcommand{\NameStatEta}{Eta statistic}
-\newcommand{\NameStatF}{$F$~statistic}
-\newcommand{\NameStatRho}{Spearman's rank correlation coefficient}
-
-Descriptive statistics are used to quantitatively describe the main 
characteristics of the data.
-They provide meaningful summaries computed over different observations or data 
records
-collected in a study.  These summaries typically form the basis of the initial 
data exploration
-as part of a more extensive statistical analysis.  Such a quantitative 
analysis assumes that
-every variable (also known as, attribute, feature, or column) in the data has 
a specific
-\emph{level of measurement}~\cite{Stevens1946:scales}.
-
-The measurement level of a variable, often called as {\bf variable type}, can 
either be
-\emph{scale} or \emph{categorical}.  A \emph{scale} variable represents the 
data measured on
-an interval scale or ratio scale.  Examples of scale variables include 
`Height', `Weight',
-`Salary', and `Temperature'.  Scale variables are also referred to as 
\emph{quantitative}
-or \emph{continuous} variables.  In contrast, a \emph{categorical} variable 
has a fixed
-limited number of distinct values or categories.  Examples of categorical 
variables
-include `Gender', `Region', `Hair color', `Zipcode', and `Level of 
Satisfaction'.
-Categorical variables can further be classified into two types, \emph{nominal} 
and
-\emph{ordinal}, depending on whether the categories in the variable can be 
ordered via an
-intrinsic ranking.  For example, there is no meaningful ranking among distinct 
values in
-`Hair color' variable, while the categories in `Level of Satisfaction' can be 
ranked from
-highly dissatisfied to highly satisfied.
-
-The input dataset for descriptive statistics is provided in the form of a 
matrix, whose
-rows are the records (data points) and whose columns are the features 
(i.e.~variables).
-Some scripts allow this matrix to be vertically split into two or three 
matrices.  Descriptive
-statistics are computed over the specified features (columns) in the matrix.  
Which
-statistics are computed depends on the types of the features.  It is important 
to keep
-in mind the following caveats and restrictions:
-\begin{Enumerate}
-\item  Given a finite set of data records, i.e.~a \emph{sample}, we take their 
feature
-values and compute their \emph{sample statistics}.  These statistics
-will vary from sample to sample even if the underlying distribution of feature 
values
-remains the same.  Sample statistics are accurate for the given sample only.
-If the goal is to estimate the \emph{distribution statistics} that are 
parameters of
-the (hypothesized) underlying distribution of the features, the corresponding 
sample
-statistics may sometimes be used as approximations, but their accuracy will 
vary.
-\item  In particular, the accuracy of the estimated distribution statistics 
will be low
-if the number of values in the sample is small.  That is, for small samples, 
the computed
-statistics may depend on the randomness of the individual sample values more 
than on
-the underlying distribution of the features.
-\item  The accuracy will also be low if the sample records cannot be assumed 
mutually
-independent and identically distributed (i.i.d.), that is, sampled at random 
from the
-same underlying distribution.  In practice, feature values in one record often 
depend
-on other features and other records, including unknown ones.
-\item  Most of the computed statistics will have low estimation accuracy in 
the presence of
-extreme values (outliers) or if the underlying distribution has heavy tails, 
for example
-obeys a power law.  However, a few of the computed statistics, such as the 
median and
-\NameStatRho{}, are \emph{robust} to outliers.
-\item  Some sample statistics are reported with their \emph{sample standard 
errors}
-in an attempt to quantify their accuracy as distribution parameter estimators. 
 But these
-sample standard errors, in turn, only estimate the underlying distribution's 
standard
-errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, 
outliers in samples,
-or heavy-tailed distributions.
-\item  We assume that the quantitative (scale) feature columns do not contain 
missing
-values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless 
otherwise
-specified.  We assume that each categorical feature column contains positive 
integers
-from 1 to the number of categories; for ordinal features, the natural order on
-the integers should coincide with the order on the categories.
-\end{Enumerate}
-
-\input{DescriptiveUnivarStats}
-
-\input{DescriptiveBivarStats}
-
-\input{DescriptiveStratStats}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms
 Reference/DescriptiveStratStats.tex
----------------------------------------------------------------------
diff --git a/docs/Algorithms Reference/DescriptiveStratStats.tex 
b/docs/Algorithms Reference/DescriptiveStratStats.tex
deleted file mode 100644
index be0cffd..0000000
--- a/docs/Algorithms Reference/DescriptiveStratStats.tex       
+++ /dev/null
@@ -1,306 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Stratified Bivariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-The {\tt stratstats.dml} script computes common bivariate statistics, such
-as correlation, slope, and their p-value, in parallel for many pairs of input
-variables in the presence of a confounding categorical variable.  The values
-of this confounding variable group the records into strata (subpopulations),
-in which all bivariate pairs are assumed free of confounding.  The script
-uses the same data model as in one-way analysis of covariance (ANCOVA), with
-strata representing population samples.  It also outputs univariate stratified
-and bivariate unstratified statistics.
-
-\begin{table}[t]\hfil
-\begin{tabular}{|l|ll|ll|ll||ll|}
-\hline
-Month of the year & \multicolumn{2}{l|}{October} & 
\multicolumn{2}{l|}{November} &
-    \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\
-Customers, millions    & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
-\hline
-Promotion (0 or 1)     & 0   & 1   & 0   & 1   & 0   & 1   & 0   & 1   \\
-Avg.\ sales per 1000   & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
-\hline
-\end{tabular}\hfil
-\caption{Stratification example: the effect of the promotion on average sales
-becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the 
months.}
-\label{table:stratexample}
-\end{table}
-
-To see how data stratification mitigates confounding, consider an (artificial)
-example in Table~\ref{table:stratexample}.  A highly seasonal retail item
-was marketed with and without a promotion over the final 3~months of the year.
-In each month the sale was more likely with the promotion than without it.
-But during the peak holiday season, when shoppers came in greater numbers and
-bought the item more often, the promotion was less frequently used.  As a 
result,
-if the 4-th quarter data is pooled together, the promotion's effect becomes
-reversed and magnified.  Stratifying by month restores the positive 
correlation.
-
-The script computes its statistics in parallel over all possible pairs from two
-specified sets of covariates.  The 1-st covariate is a column in input 
matrix~$X$
-and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ 
may
-be the same or different.  The columns of interest are given by their index 
numbers
-in special matrices.  The stratum column, specified in its own matrix, is the 
same
-for all covariate pairs.
-
-Both covariates in each pair must be numerical, with the 2-nd covariate 
normally
-distributed given the 1-st covariate (see~Details).  Missing covariate values 
or
-strata are represented by~``NaN''.  Records with NaN's are selectively omitted
-wherever their NaN's are material to the output statistic.
-
-\smallskip
-\pagebreak[3]
-
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}stratstats.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Xcid=}path/file
-{\tt{} Y=}path/file
-{\tt{} Ycid=}path/file
-{\tt{} S=}path/file
-{\tt{} Scid=}int
-{\tt{} O=}path/file
-{\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read matrix $X$ whose columns we want to use as
-the 1-st covariate (i.e.~as the feature variable)
-\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
-Location to read the single-row matrix that lists all index numbers
-of the $X$-columns used as the 1-st covariate; the default value means
-``use all $X$-columns''
-\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $Y$ whose columns we want to use as the 2-nd
-covariate (i.e.~as the response variable); the default value means
-``use $X$ in place of~$Y$''
-\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
-Location to read the single-row matrix that lists all index numbers
-of the $Y$-columns used as the 2-nd covariate; the default value means
-``use all $Y$-columns''
-\item[{\tt S}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $S$ that has the stratum column.
-Note: the stratum column must contain small positive integers; all fractional
-values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
-missing.  The default value for {\tt S} means ``use $X$ in place of~$S$''
-\item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
-The index number of the stratum column in~$S$
-\item[{\tt O}:]
-Location to store the output matrix defined in Table~\ref{table:stratoutput}
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\hfil
-\begin{tabular}{|rcl|rcl|}
-\hline
-& Col.\# & Meaning & & Col.\# & Meaning \\
-\hline
-\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
-& 01     & $X$-column number                & 
-\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
-& 11     & $Y$-column number                \\
-& 02     & presence count for $x$           & 
-& 12     & presence count for $y$           \\
-& 03     & global mean $(x)$                & 
-& 13     & global mean $(y)$                \\
-& 04     & global std.\ dev. $(x)$          & 
-& 14     & global std.\ dev. $(y)$          \\
-& 05     & stratified std.\ dev. $(x)$      & 
-& 15     & stratified std.\ dev. $(y)$      \\
-& 06     & $R^2$ for $x \sim {}$strata      & 
-& 16     & $R^2$ for $y \sim {}$strata      \\
-& 07     & adjusted $R^2$ for $x \sim {}$strata      & 
-& 17     & adjusted $R^2$ for $y \sim {}$strata      \\
-& 08     & p-value, $x \sim {}$strata       & 
-& 18     & p-value, $y \sim {}$strata       \\
-& 09--10 & reserved                         & 
-& 19--20 & reserved                         \\
-\hline
-\multirow{9}{*}{\begin{sideways}$y\sim x$, NO 
strata\end{sideways}}\hspace{-1.15em}
-& 21     & presence count $(x, y)$          &
-\multirow{10}{*}{\begin{sideways}$y\sim x$ AND 
strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
-& 31     & presence count $(x, y, s)$       \\
-& 22     & regression slope                 &
-& 32     & regression slope                 \\
-& 23     & regres.\ slope std.\ dev.        &
-& 33     & regres.\ slope std.\ dev.        \\
-& 24     & correlation${} = \pm\sqrt{R^2}$  &
-& 34     & correlation${} = \pm\sqrt{R^2}$  \\
-& 25     & residual std.\ dev.              &
-& 35     & residual std.\ dev.              \\
-& 26     & $R^2$ in $y$ due to $x$          &
-& 36     & $R^2$ in $y$ due to $x$          \\
-& 27     & adjusted $R^2$ in $y$ due to $x$ &
-& 37     & adjusted $R^2$ in $y$ due to $x$ \\
-& 28     & p-value for ``slope = 0''        &
-& 38     & p-value for ``slope = 0''        \\
-& 29     & reserved                         &
-& 39     & \# strata with ${\geq}\,2$ count \\
-& 30     & reserved                         &
-& 40     & reserved                         \\
-\hline
-\end{tabular}\hfil
-\caption{The {\tt stratstats.dml} output matrix has one row per each distinct
-pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
-here.}
-\label{table:stratoutput}
-\end{table}
-
-
-
-
-\noindent{\bf Details}
-\smallskip
-
-Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ 
is
-a stratum number and $(x, y)$ are two numerical covariates.  We want to analyze
-conditional linear relationship between $y$ and $x$ conditioned by~$i$.
-Note that $x$, but not~$y$, may represent a categorical variable if we assign a
-numerical value to each category, for example 0 and 1 for two categories.
-
-We assume a linear regression model for~$y$:
-\begin{equation}
-y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, 
\quad\textrm{where}\,\,\,\,
-\eps_{i,j} \sim \Normal(0, \sigma^2)
-\label{eqn:stratlinmodel}
-\end{equation}
-Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record 
number
-in stratum~$i$; by $n_i$ we denote the number of records available in 
stratum~$i$.
-The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
-When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
-stratum~$i$ as
-\begin{equation*}
-\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
-\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
-\end{equation*}
-If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta 
\bar{x}_i$,
-which gives the prediction error sum-of-squares of
-\begin{equation}
-\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - 
(\bar{y}_i - \beta \bar{x}_i)\big)^2
-\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
-\label{eqn:stratsumsq}
-\end{equation}
-where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' 
sample
-estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ 
computed as
-\begin{align*}
-V_x     \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - 
\bar{x}_i\big)^2; \quad
-V_y     \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
\bar{y}_i\big)^2;\\
-V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - 
\bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
-\end{align*}
-They are stratified because we compute the sample (co-)variances in each 
stratum~$i$
-separately, then combine by summation.  The stratified estimates for $\Var(X)$ 
and $\Var(Y)$
-tend to be smaller than the non-stratified ones (with the global mean instead 
of $\bar{x}_i$
-and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and 
$y_{i,j}$
-than the global means.  The stratified variance estimates the uncertainty in 
$x_{i,j}$ 
-and~$y_{i,j}$ given their stratum~$i$.
-
-Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
-gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
-with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
-\begin{equation*}
-\mathrm{RSS} \,\,=\, \,
-\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
-\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
-\,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
-\end{equation*}
-The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, 
estimates the fraction
-of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the 
linear 
-regression model~(\ref{eqn:stratlinmodel}).  We define \emph{stratified 
correlation} as the
-square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$.  We also use RSS 
to estimate
-the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that 
models the prediction error
-of $y_{i,j}$ given $x_{i,j}$ and the stratum:
-\begin{equation*}
-\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, 
\frac{V_{x,y}}{\sqrt{V_x V_y}};
-\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
-\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
-\Big(n = \sum_{i=1}^k n_i\Big)
-\end{equation*}
-
-The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
-obtained by considering the effect of $\hat{\beta}$ on the residual 
sum-of-squares,
-measured by the decrease from $V_y$ to~RSS.
-The $F$-statistic is the ratio of the ``explained'' sum-of-squares
-to the residual sum-of-squares, divided by their corresponding degrees of 
freedom.
-There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces 
that
-to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 
degree
-of freedom:
-\begin{equation*}
-F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
-\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
-t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
-\end{equation*}
-The $t$-statistic is simply the square root of the $F$-statistic with the 
appropriate
-choice of sign.  If the null hypothesis and the linear model are both true, 
the $t$-statistic
-has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom.  We 
can
-also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
-\begin{equation*}
-\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} 
\quad\Longrightarrow\quad
-t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, 
\stdev(\hat{\beta})_{\mathrm{est}}
-\end{equation*}
-The standard deviation estimate for~$\beta$ is included in {\tt 
stratstats.dml} output.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The output matrix format is defined in Table~\ref{table:stratoutput}.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx 
Xcid=/user/biadmin/Xcid.mtx
-  Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx 
Scid=2
-  O=/user/biadmin/Out.mtx fmt=csv
-
-}
-{\hangindent=\parindent\noindent\tt
-\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx 
Xcid=/user/biadmin/Xcid.mtx
-  Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx
-
-}
-
-%\smallskip
-%\noindent{\bf See Also}
-%\smallskip
-%
-%For non-stratified bivariate statistics with a wider variety of input data 
types
-%and statistical tests, see \ldots.  For general linear regression, see
-%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}.  For logistic regression, 
appropriate
-%when the response variable is categorical, see {\tt MultiLogReg.dml}.
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms
 Reference/DescriptiveUnivarStats.tex
----------------------------------------------------------------------
diff --git a/docs/Algorithms Reference/DescriptiveUnivarStats.tex 
b/docs/Algorithms Reference/DescriptiveUnivarStats.tex
deleted file mode 100644
index 5838e3e..0000000
--- a/docs/Algorithms Reference/DescriptiveUnivarStats.tex      
+++ /dev/null
@@ -1,603 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Univariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-\emph{Univariate statistics} are the simplest form of descriptive statistics 
in data
-analysis.  They are used to quantitatively describe the main characteristics 
of each
-feature in the data.  For a given dataset matrix, script \UnivarScriptName{} 
computes
-certain univariate statistics for each feature column in the
-matrix.  The feature type governs the exact set of statistics computed for 
that feature.
-For example, the statistic \emph{mean} can only be computed on a quantitative 
(scale)
-feature like `Height' and `Temperature'.  It does not make sense to compute 
the mean
-of a categorical attribute like `Hair Color'.
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%\tolerance=0
-{\tt{}-f } \UnivarScriptName{}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} TYPES=}path/file
-{\tt{} STATS=}path/file
-% {\tt{} fmt=}format
-
-}
-
-
-\medskip
-\pagebreak[2]
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the data matrix $X$ whose columns we want to
-analyze as the features.
-\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
-column-cell contains the type of the $i^{\textrm{th}}$ feature column
-\texttt{X[,$\,i$]} in the data matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all $X$-columns as scale.''
-\item[{\tt STATS}:]
-Location (on HDFS) where the output matrix of computed statistics
-will be stored.  The format of the output matrix is defined by
-Table~\ref{table:univars}.
-% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-% see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|rl|c|c|}
-\hline
-\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & 
\multicolumn{2}{c|}{Applies to:} \\
-                            &                            & Scale & Categ.\\
-\hline
-\OutputRowIDMinimum         & Minimum                    &   +   &       \\
-\OutputRowIDMaximum         & Maximum                    &   +   &       \\
-\OutputRowIDRange           & Range                      &   +   &       \\
-\OutputRowIDMean            & Mean                       &   +   &       \\
-\OutputRowIDVariance        & Variance                   &   +   &       \\
-\OutputRowIDStDeviation     & Standard deviation         &   +   &       \\
-\OutputRowIDStErrorMean     & Standard error of mean     &   +   &       \\
-\OutputRowIDCoeffVar        & Coefficient of variation   &   +   &       \\
-\OutputRowIDSkewness        & Skewness                   &   +   &       \\
-\OutputRowIDKurtosis        & Kurtosis                   &   +   &       \\
-\OutputRowIDStErrorSkewness & Standard error of skewness &   +   &       \\
-\OutputRowIDStErrorCurtosis & Standard error of kurtosis &   +   &       \\
-\OutputRowIDMedian          & Median                     &   +   &       \\
-\OutputRowIDIQMean          & Inter quartile mean        &   +   &       \\
-\OutputRowIDNumCategories   & Number of categories       &       &   +   \\
-\OutputRowIDMode            & Mode                       &       &   +   \\
-\OutputRowIDNumModes        & Number of modes            &       &   +   \\
-\hline
-\end{tabular}\hfil
-\caption{The output matrix of \UnivarScriptName{} has one row per each
-univariate statistic and one column per input feature.  This table lists
-the meaning of each row.  Signs ``+'' show applicability to scale or/and
-to categorical features.}
-\label{table:univars}
-\end{table}
-
-
-\pagebreak[1]
-
-\smallskip
-\noindent{\bf Details}
-\smallskip
-
-Given an input matrix \texttt{X}, this script computes the set of all
-relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
-in~\texttt{X}.  The list of statistics to be computed depends on the
-\emph{type}, or \emph{measurement level}, of each column.
-The \textrm{TYPES} command-line argument points to a vector containing
-the types of all columns.  The types must be provided as per the following
-convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-
-Below we list all univariate statistics computed by script \UnivarScriptName.
-The statistics are collected by relevance into several groups, namely: central
-tendency, dispersion, shape, and categorical measures.  The first three groups
-contain statistics computed for a quantitative (also known as: numerical, 
scale,
-or continuous) feature; the last group contains the statistics for a 
categorical
-(either nominal or ordinal) feature.  
-
-Let~$n$ be the number of data records (rows) with feature values.
-In what follows we fix a column index \texttt{idx} and consider
-sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
-Let $v = (v_1, v_2, \ldots, v_n)$ be the values of 
\texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
-in their original unsorted order: $v_i = 
\texttt{X[}i\texttt{,}\,\texttt{idx]}$.
-Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted 
order,
-preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-
-\paragraph{Central tendency measures.}
-Sample statistics that describe the location of the quantitative (scale) 
feature distribution,
-represent it with a single value.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Mean]
-\OutputRowText{\OutputRowIDMean}
-The arithmetic average over a sample of a quantitative feature.
-Computed as the ratio between the sum of values and the number of values:
-$\left(\sum_{i=1}^n v_i\right)\!/n$.
-Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 
7.8$\}$
-equals~5.2.
-
-Note that the mean is significantly affected by extreme values in the sample
-and may be misleading as a central tendency measure if the feature varies on
-exponential scale.  For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 
100.0$\}$
-is 22.222, greater than all the sample values except the~largest.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\begin{figure}[t]
-\setlength{\unitlength}{10pt}
-\begin{picture}(33,12)
-\put( 6.2, 0.0){\small 2.2}
-\put(10.2, 0.0){\small 3.2}
-\put(12.2, 0.0){\small 3.7}
-\put(15.0, 0.0){\small 4.4}
-\put(18.6, 0.0){\small 5.3}
-\put(20.2, 0.0){\small 5.7}
-\put(21.75,0.0){\small 6.1}
-\put(23.05,0.0){\small 6.4}
-\put(26.2, 0.0){\small 7.2}
-\put(28.6, 0.0){\small 7.8}
-\put( 0.5, 0.7){\small 0.0}
-\put( 0.1, 3.2){\small 0.25}
-\put( 0.5, 5.7){\small 0.5}
-\put( 0.1, 8.2){\small 0.75}
-\put( 0.5,10.7){\small 1.0}
-\linethickness{1.5pt}
-\put( 2.0, 1.0){\line(1,0){4.8}}
-\put( 6.8, 1.0){\line(0,1){1.0}}
-\put( 6.8, 2.0){\line(1,0){4.0}}
-\put(10.8, 2.0){\line(0,1){1.0}}
-\put(10.8, 3.0){\line(1,0){2.0}}
-\put(12.8, 3.0){\line(0,1){1.0}}
-\put(12.8, 4.0){\line(1,0){2.8}}
-\put(15.6, 4.0){\line(0,1){1.0}}
-\put(15.6, 5.0){\line(1,0){3.6}}
-\put(19.2, 5.0){\line(0,1){1.0}}
-\put(19.2, 6.0){\line(1,0){1.6}}
-\put(20.8, 6.0){\line(0,1){1.0}}
-\put(20.8, 7.0){\line(1,0){1.6}}
-\put(22.4, 7.0){\line(0,1){1.0}}
-\put(22.4, 8.0){\line(1,0){1.2}}
-\put(23.6, 8.0){\line(0,1){1.0}}
-\put(23.6, 9.0){\line(1,0){3.2}}
-\put(26.8, 9.0){\line(0,1){1.0}}
-\put(26.8,10.0){\line(1,0){2.4}}
-\put(29.2,10.0){\line(0,1){1.0}}
-\put(29.2,11.0){\line(1,0){4.8}}
-\linethickness{0.3pt}
-\put( 6.8, 1.0){\circle*{0.3}}
-\put(10.8, 1.0){\circle*{0.3}}
-\put(12.8, 1.0){\circle*{0.3}}
-\put(15.6, 1.0){\circle*{0.3}}
-\put(19.2, 1.0){\circle*{0.3}}
-\put(20.8, 1.0){\circle*{0.3}}
-\put(22.4, 1.0){\circle*{0.3}}
-\put(23.6, 1.0){\circle*{0.3}}
-\put(26.8, 1.0){\circle*{0.3}}
-\put(29.2, 1.0){\circle*{0.3}}
-\put( 6.8, 1.0){\vector(1,0){27.2}}
-\put( 2.0, 1.0){\vector(0,1){10.8}}
-\put( 2.0, 3.5){\line(1,0){10.8}}
-\put( 2.0, 6.0){\line(1,0){17.2}}
-\put( 2.0, 8.5){\line(1,0){21.6}}
-\put( 2.0,11.0){\line(1,0){27.2}}
-\put(12.8, 1.0){\line(0,1){2.0}}
-\put(19.2, 1.0){\line(0,1){5.0}}
-\put(20.0, 1.0){\line(0,1){5.0}}
-\put(23.6, 1.0){\line(0,1){7.0}}
-\put( 9.0, 4.0){\line(1,0){3.8}}
-\put( 9.2, 2.7){\vector(0,1){0.8}}
-\put( 9.2, 4.8){\vector(0,-1){0.8}}
-\put(19.4, 8.0){\line(1,0){3.0}}
-\put(19.6, 7.2){\vector(0,1){0.8}}
-\put(19.6, 9.3){\vector(0,-1){0.8}}
-\put(13.0, 2.2){\small $q_{25\%}$}
-\put(17.3, 2.2){\small $q_{50\%}$}
-\put(23.8, 2.2){\small $q_{75\%}$}
-\put(20.15,3.5){\small $\mu$}
-\put( 8.0, 3.75){\small $\phi_1$}
-\put(18.35,7.8){\small $\phi_2$}
-\end{picture}
-\label{fig:example_quartiles}
-\caption{The computation of quartiles, median, and interquartile mean from the
-empirical distribution function of the 10-point
-sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$.  Each 
vertical step in
-the graph has height~$1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and 
$q_{75\%}$ denote
-the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles 
correspondingly;
-value~$\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial 
contribution
-of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile 
mean.}
-\end{figure}
-
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Median]
-\OutputRowText{\OutputRowIDMedian}
-The ``middle'' value that separates the higher half of the sample values
-(in a sorted order) from the lower half.
-To compute the median, we sort the sample in the increasing order, preserving
-duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
-same as the $50^{\textrm{th}}$~percentile of the sample.
-If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and 
$v^s_{n/2\,+\,1}$,
-so we compute the median as the mean of these two values.
-(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
-not as the median.)  Example: the median of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see 
Figure~\ref{fig:example_quartiles}.
-
-Unlike the mean, the median is not sensitive to extreme values in the sample,
-i.e.\ it is robust to outliers.  It works better as a measure of central 
tendency
-for heavy-tailed distributions and features that vary on exponential scale.
-However, the median is sensitive to small sample size.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Interquartile mean]
-\OutputRowText{\OutputRowIDIQMean}
-For a sample of a quantitative feature, this is
-the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
-and less than or equal the $3^{\textrm{rd}}$ quartile.
-In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
-the highest 25$\%$ of the sorted values are omitted in its computation.
-The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the 
$3^{\textrm{rd}}$
-quartiles themselves, contribute to this mean only partially.
-This measure is occasionally used as the ``robust'' version of the mean
-that is less sensitive to the extreme values.
-
-To compute the measure, we sort the sample in the increasing order,
-preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
-and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
-then compute the following weighted mean:
-\begin{equation*}
-\frac{1}{3{/}4 - 1{/}4} \left[
-\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ 
-\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i 
-\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
-\end{equation*}
-In other words, all sample values between the $1^{\textrm{st}}$ and the 
$3^{\textrm{rd}}$
-quartile enter the sum with weights $2{/}n$, times their number of duplicates, 
while the
-two quartiles themselves enter the sum with reduced weights.  The weights are 
proportional
-to the vertical steps in the empirical distribution function of the sample, see
-Figure~\ref{fig:example_quartiles} for an illustration.
-Example: the interquartile mean of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
-$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
-which equals~5.31.
-\end{Description}
-
-
-\paragraph{Dispersion measures.}
-Statistics that describe the amount of variation or spread in a quantitative
-(scale) data feature.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Variance]
-\OutputRowText{\OutputRowIDVariance}
-A measure of dispersion, or spread-out, of sample values around their mean,
-expressed in units that are the square of those of the feature itself.
-Computed as the sum of squared differences between the values
-in the sample and their mean, divided by one less than the number of
-values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where 
-$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-Example: the variance of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
-Note that at least two values ($n\geq 2$) are required to avoid division
-by zero.  Sample variance is sensitive to outliers, even more than the mean.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard deviation]
-\OutputRowText{\OutputRowIDStDeviation}
-A measure of dispersion around the mean, the square root of variance.
-Computed by taking the square root of the sample variance;
-see \emph{Variance} above on computing the variance.
-Example: the standard deviation of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
-At least two values are required to avoid division by zero.
-Note that standard deviation is sensitive to outliers.  
-
-Standard deviation is used in conjunction with the mean to determine
-an interval containing a given percentage of the feature values,
-assuming the normal distribution.  In a large sample from a normal
-distribution, around 68\% of the cases fall within one standard
-deviation and around 95\% of cases fall within two standard deviations
-of the mean.  For example, if the mean age is 45 with a standard deviation
-of 10, around 95\% of the cases would be between 25 and 65 in a normal
-distribution.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Coefficient of variation]
-\OutputRowText{\OutputRowIDCoeffVar}
-The ratio of the standard deviation to the mean, i.e.\ the
-\emph{relative} standard deviation, of a quantitative feature sample.
-Computed by dividing the sample \emph{standard deviation} by the
-sample \emph{mean}, see above for their computation details.
-Example: the coefficient of variation for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.
-
-This metric is used primarily with non-negative features such as
-financial or population data.  It is sensitive to outliers.
-Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
-At least two values (records) are required to compute the standard deviation.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Minimum]
-\OutputRowText{\OutputRowIDMinimum}
-The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
-Example: the minimum of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals~2.2.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Maximum]
-\OutputRowText{\OutputRowIDMaximum}
-The largest value of a quantitative sample, computed as $\max v = v^s_n$.
-Example: the maximum of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals~7.8.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Range]
-\OutputRowText{\OutputRowIDRange}
-The difference between the largest and the smallest value of a quantitative
-sample, computed as $\max v - \min v = v^s_n - v^s_1$.
-It provides information about the overall spread of the sample values.
-Example: the range of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals 7.8$\,{-}\,$2.2~${=}$~5.6.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error of the mean]
-\OutputRowText{\OutputRowIDStErrorMean}
-A measure of how much the value of the sample mean may vary from sample
-to sample taken from the same (hypothesized) distribution of the feature.
-It helps to roughly bound the distribution mean, i.e.\
-the limit of the sample mean as the sample size tends to infinity.
-Under certain assumptions (e.g.\ normality and large sample), the difference
-between the distribution mean and the sample mean is unlikely to exceed
-2~standard errors.
-
-The measure is computed by dividing the sample standard deviation
-by the square root of the number of values~$n$; see \emph{standard deviation}
-for its computation details.  Ensure $n\,{\geq}\,2$ to avoid division by~0.
-Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 the standard error of the mean
-equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.
-
-Note that the standard error itself is subject to sample randomness.
-Its accuracy as an error estimator may be low if the sample size is small
-or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
-heavy tails.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-% \item[\it Quartiles]
-% \OutputRowText{\OutputRowIDQuartiles}
-% %%% dsDefn %%%%
-% The values of a quantitative feature
-% that divide an ordered/sorted set of data records into four equal-size 
groups.
-% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
-% the sorted data into the lowest $25\%$ and the highest~$75\%$.  In other 
words,
-% it is the middle value between the minimum and the median.  The 
$2^{\textrm{nd}}$
-% quartile is the median itself, the value that separates the higher half of
-% the data (in the sorted order) from the lower half.  Finally, the 
$3^{\textrm{rd}}$
-% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
-% lowest $75\%$ and highest~$25\%$.\par
-% %%% dsComp %%%%
-% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical 
values
-% we sort it in the increasing order, preserving duplicates, then return 
-% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
-% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
-% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further 
adjusted
-% to equal the mean of two middle values
-% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
-% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
-% %%% dsWarn %%%%
-% We assume that the feature column does not contain \texttt{NaN}s or coded 
non-numeric values.
-% %%% dsExmpl %%%
-% \textbf{Example(s).}
-\end{Description}
-
-
-\paragraph{Shape measures.}
-Statistics that describe the shape and symmetry of the quantitative (scale)
-feature distribution estimated from a sample of its values.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Skewness]
-\OutputRowText{\OutputRowIDSkewness}
-It measures how symmetrically the values of a feature are spread out
-around the mean.  A significant positive skewness implies a longer (or fatter)
-right tail, i.e. feature values tend to lie farther away from the mean on the
-right side.  A significant negative skewness implies a longer (or fatter) left
-tail.  The normal distribution is symmetric and has a skewness value of~0;
-however, its sample skewness is likely to be nonzero, just close to zero.
-As a guideline, a skewness value more than twice its standard error is taken
-to indicate a departure from symmetry.
-
-Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the 
cube
-of the standard deviation.  We estimate the $3^{\textrm{rd}}$~central moment as
-the sum of cubed differences between the values in the feature column and their
-sample mean, divided by the number of values:  
-$\sum_{i=1}^n (v_i - \bar{v})^3 / n$
-where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-The standard deviation is computed
-as described above in \emph{standard deviation}.  To avoid division by~0,
-at least two different sample values are required.  Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 and the standard deviation of~1.8
-skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
-Note: skewness is sensitive to outliers.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error in skewness]
-\OutputRowText{\OutputRowIDStErrorSkewness}
-A measure of how much the sample skewness may vary from sample to sample,
-assuming that the feature is normally distributed, which makes its
-distribution skewness equal~0.  
-Given the number~$n$ of sample values, the standard error is computed as
-\begin{equation*}
-\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
-\end{equation*}
-This measure can tell us, for example:
-\begin{Itemize}
-\item If the sample skewness lands within two standard errors from~0, its
-positive or negative sign is non-significant, may just be accidental.
-\item If the sample skewness lands outside this interval, the feature
-is unlikely to be normally distributed.
-\end{Itemize}
-At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
-Note that the standard error is inaccurate if the feature distribution is
-far from normal or if the number of samples is small.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Kurtosis]
-\OutputRowText{\OutputRowIDKurtosis}
-As a distribution parameter, kurtosis is a measure of the extent to which
-feature values cluster around a central point.  In other words, it quantifies
-``peakedness'' of the distribution: how tall and sharp the central peak is
-relative to a standard bell curve.
-
-Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
-to a normal distribution:
-\begin{Itemize}
-\item observations cluster more about the center (peak-shaped),
-\item the tails are thinner at non-extreme values, 
-\item the tails are thicker at extreme values.
-\end{Itemize}
-Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
-to a normal distribution:
-\begin{Itemize}
-\item observations cluster less about the center (box-shaped),
-\item the tails are thicker at non-extreme values, 
-\item the tails are thinner at extreme values.
-\end{Itemize}
-Kurtosis of a normal distribution is zero; however, the sample kurtosis
-(computed here) is likely to deviate from zero.
-
-Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
-by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
-We estimate the $4^{\textrm{th}}$~central moment as the sum of the
-$4^{\textrm{th}}$~powers of differences between the values in the feature 
column
-and their sample mean, divided by the number of values:
-$\sum_{i=1}^n (v_i - \bar{v})^4 / n$
-where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-The standard deviation is computed as described above, see \emph{standard 
deviation}.
-
-Note that kurtosis is sensitive to outliers, and requires at least two 
different
-sample values.  Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 and the standard deviation of~1.8,
-sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error in kurtosis]
-\OutputRowText{\OutputRowIDStErrorCurtosis}
-A measure of how much the sample kurtosis may vary from sample to sample,
-assuming that the feature is normally distributed, which makes its
-distribution kurtosis equal~0.
-Given the number~$n$ of sample values, the standard error is computed as
-\begin{equation*}
-\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
-\end{equation*}
-This measure can tell us, for example:
-\begin{Itemize}
-\item If the sample kurtosis lands within two standard errors from~0, its
-positive or negative sign is non-significant, may just be accidental.
-\item If the sample kurtosis lands outside this interval, the feature
-is unlikely to be normally distributed.
-\end{Itemize}
-At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
-Note that the standard error is inaccurate if the feature distribution is
-far from normal or if the number of samples is small.
-\end{Description}
-
-
-\paragraph{Categorical measures.}  Statistics that describe the sample of
-a categorical feature, either nominal or ordinal.  We represent all
-categories by integers from~1 to the number of categories; we call
-these integers \emph{category~IDs}.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Number of categories]
-\OutputRowText{\OutputRowIDNumCategories}
-The maximum category~ID that occurs in the sample.  Note that some
-categories with~IDs \emph{smaller} than this maximum~ID may have
-no~occurrences in the sample, without reducing the number of categories.
-However, any categories with~IDs \emph{larger} than the maximum~ID with
-no occurrences in the sample will not be counted.
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-the number of categories is reported as~8.  Category~IDs 2 and~6, which have
-zero occurrences, are still counted; but if there is a category with
-ID${}=9$ and zero occurrences, it is not counted.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Mode]
-\OutputRowText{\OutputRowIDMode}
-The most frequently occurring category value.
-If several values share the greatest frequency of occurrence, then each
-of them is a mode; but here we report only the smallest of these modes.
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-the modes are 3 and~7, with 3 reported.
-
-Computed by counting the number of occurrences for each category,
-then taking the smallest category~ID that has the maximum count.
-Note that the sample modes may be different from the distribution modes,
-i.e.\ the categories whose (hypothesized) underlying probability is the
-maximum over all categories.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Number of modes]
-\OutputRowText{\OutputRowIDNumModes}
-The number of category values that each have the largest frequency
-count in the sample.  
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-there are two category IDs (3 and~7) that occur the maximum count of 4~times;
-hence, we return~2.
-
-Computed by counting the number of occurrences for each category,
-then counting how many categories have the maximum count.
-Note that the sample modes may be different from the distribution modes,
-i.e.\ the categories whose (hypothesized) underlying probability is the
-maximum over all categories.
-\end{Description}
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The output matrix containing all computed statistics is of size $17$~rows and
-as many columns as in the input matrix~\texttt{X}.  Each row corresponds to
-a particular statistic, according to the convention specified in
-Table~\ref{table:univars}.  The first $14$~statistics are applicable for
-\emph{scale} columns, and the last $3$~statistics are applicable for 
categorical,
-i.e.\ nominal and ordinal, columns.
-
-
-\pagebreak[2]
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
-  TYPES=/user/biadmin/types.mtx
-  STATS=/user/biadmin/stats.mtx
-
-}

[8/9] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Reply via email to