http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms Reference/DescriptiveBivarStats.tex ---------------------------------------------------------------------- diff --git a/docs/Algorithms Reference/DescriptiveBivarStats.tex b/docs/Algorithms Reference/DescriptiveBivarStats.tex deleted file mode 100644 index a2d3db1..0000000 --- a/docs/Algorithms Reference/DescriptiveBivarStats.tex +++ /dev/null @@ -1,438 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Bivariate Statistics} - -\noindent{\bf Description} -\smallskip - -Bivariate statistics are used to quantitatively describe the association between -two features, such as test their statistical (in-)dependence or measure -the accuracy of one data feature predicting the other feature, in a sample. -The \BivarScriptName{} script computes common bivariate statistics, -such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs -of data features. For a given dataset matrix, script \BivarScriptName{} computes -certain bivariate statistics for the given feature (column) pairs in the -matrix. The feature types govern the exact set of statistics computed for that pair. -For example, \NameStatR{} can only be computed on two quantitative (scale) -features like `Height' and `Temperature'. -It does not make sense to compute the linear correlation of two categorical attributes -like `Hair Color'. - - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it%\tolerance=0 -{\tt{}-f }path/\/\BivarScriptName{} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} index1=}path/file -{\tt{} index2=}path/file -{\tt{} types1=}path/file -{\tt{} types2=}path/file -{\tt{} OUTDIR=}path -% {\tt{} fmt=}format - -} - - -\smallskip -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read the data matrix $X$ whose columns are the features -that we want to compare and correlate with bivariate statistics. -\item[{\tt index1}:] % (default:\mbox{ }{\tt " "}) -Location (on HDFS) to read the single-row matrix that lists the column indices -of the \emph{first-argument} features in pairwise statistics. -Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the -index $k$ of column \texttt{X[,$\,k$]} in the data matrix -whose bivariate statistics need to be computed. -% The default value means ``use all $X$-columns from the first to the last.'' -\item[{\tt index2}:] % (default:\mbox{ }{\tt " "}) -Location (on HDFS) to read the single-row matrix that lists the column indices -of the \emph{second-argument} features in pairwise statistics. -Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the -index $l$ of column \texttt{X[,$\,l$]} in the data matrix -whose bivariate statistics need to be computed. -% The default value means ``use all $X$-columns from the first to the last.'' -\item[{\tt types1}:] % (default:\mbox{ }{\tt " "}) -Location (on HDFS) to read the single-row matrix that lists the \emph{types} -of the \emph{first-argument} features in pairwise statistics. -Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type -of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$ -entry in the {\tt index1} matrix. Feature types must be encoded by -integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. -% The default value means ``treat all referenced $X$-columns as scale.'' -\item[{\tt types2}:] % (default:\mbox{ }{\tt " "}) -Location (on HDFS) to read the single-row matrix that lists the \emph{types} -of the \emph{second-argument} features in pairwise statistics. -Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type -of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$ -entry in the {\tt index2} matrix. Feature types must be encoded by -integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. -% The default value means ``treat all referenced $X$-columns as scale.'' -\item[{\tt OUTDIR}:] -Location path (on HDFS) where the output matrices with computed bivariate -statistics will be stored. The matrices' file names and format are defined -in Table~\ref{table:bivars}. -% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -% see read/write functions in SystemML Language Reference for details. -\end{Description} - -\begin{table}[t]\hfil -\begin{tabular}{|lll|} -\hline\rule{0pt}{12pt}% -Ouput File / Matrix & Row$\,$\# & Name of Statistic \\[2pt] -\hline\hline\rule{0pt}{12pt}% -\emph{All Files} & 1 & 1-st feature column \\ -\rule{1em}{0pt}" & 2 & 2-nd feature column \\[2pt] -\hline\rule{0pt}{12pt}% -bivar.scale.scale.stats & 3 & \NameStatR \\[2pt] -\hline\rule{0pt}{12pt}% -bivar.nominal.nominal.stats & 3 & \NameStatChi \\ -\rule{1em}{0pt}" & 4 & Degrees of freedom \\ -\rule{1em}{0pt}" & 5 & \NameStatPChi \\ -\rule{1em}{0pt}" & 6 & \NameStatV \\[2pt] -\hline\rule{0pt}{12pt}% -bivar.nominal.scale.stats & 3 & \NameStatEta \\ -\rule{1em}{0pt}" & 4 & \NameStatF \\[2pt] -\hline\rule{0pt}{12pt}% -bivar.ordinal.ordinal.stats & 3 & \NameStatRho \\[2pt] -\hline -\end{tabular}\hfil -\caption{% -The output matrices of \BivarScriptName{} have one row per one bivariate -statistic and one column per one pair of input features. This table lists -the meaning of each matrix and each row.% -% Signs ``+'' show applicability to scale or/and to categorical features. -} -\label{table:bivars} -\end{table} - - - -\pagebreak[2] - -\noindent{\bf Details} -\smallskip - -Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent -the features and whose rows represent the records of a data sample. -Given \texttt{X}, the script computes certain relevant bivariate statistics -for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}. -Command-line parameters \texttt{index1} and \texttt{index2} specify the files with -column pairs of interest to the user. Namely, the file given by \texttt{index1} -contains the vector of the 1st-attribute column indices and the file given -by \texttt{index2} has the vector of the 2nd-attribute column indices, with -``1st'' and ``2nd'' referring to their places in bivariate statistics. -Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix -of positive integers. - -The bivariate statistics to be computed depend on the \emph{types}, or -\emph{measurement levels}, of the two columns. -The types for each pair are provided in the files whose locations are specified by -\texttt{types1} and \texttt{types2} command-line parameters. -These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and -the 2nd-attribute column types in the same order as their indices in the -\texttt{index1} and \texttt{index2} files. The types must be provided as per -the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. - -The script orgainizes its results into (potentially) four output matrices, one per -each type combination. The types of bivariate statistics are defined using the types -of the columns that were used for their arguments, with ``ordinal'' sometimes -retrogressing to ``nominal.'' Table~\ref{table:bivars} describes what each column -in each output matrix contains. In particular, the script includes the following -statistics: -\begin{Itemize} -\item For a pair of scale (quantitative) columns, \NameStatR; -\item For a pair of nominal columns (with finite-sized, fixed, unordered domains), -the \NameStatChi{} and its p-value; -\item For a pair of one scale column and one nominal column, \NameStatF{}; -\item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho. -\end{Itemize} -Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the -column indices of the features involved in each statistic. -Moreover, if the output matrix does not contain -a value in a certain cell then it should be interpreted as a~$0$ -(sparse matrix representation). - -Below we list all bivariate statistics computed by script \BivarScriptName. -The statistics are collected into several groups by the type of their input -features. We refer to the two input features as $v_1$ and $v_2$ unless -specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$, -where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size. - - -\paragraph{Scale-vs-scale statistics.} -Sample statistics that describe association between two quantitative (scale) features. -A scale feature has numerical values, with the natural ordering relation. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatR]: -A measure of linear dependence between two numerical features: -\begin{equation*} -r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}} -\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}% -{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}} -\end{equation*} -Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value -pairs $(v_{1,i}, v_{2,i})$ lie on the same line. Correlation near~0 means that a line is not a good -way to represent the dependence between the two features; however, this does not imply independence. -The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to -linearly increase (decrease) when the other feature increases. Nonlinear association, if present, -may disobey this sign. -\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$ -to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$. - -Suppose that we use simple linear regression to represent one feature given the other, say -represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$ -to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$. -Then the best error equals -\begin{equation*} -\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\, -(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2 -\end{equation*} -In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to -the total sum of squares. Hence, $r^2$ is an accuracy measure of the linear regression. -\end{Description} - - -\paragraph{Nominal-vs-nominal statistics.} -Sample statistics that describe association between two nominal categorical features. -Both features' value domains are encoded with positive integers in arbitrary order: -nominal features do not order their value domains. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatChi]: -A measure of how much the frequencies of value pairs of two categorical features deviate from -statistical independence. Under independence, the probability of every value pair must equal -the product of probabilities of each value in the pair: -$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$. But we do not know these (hypothesized) probabilities; -we only know the sample frequency counts. Let $n_{a,b}$ be the frequency count of pair -$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone. Under -independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due -to sample randomness, yet it is unlikely to be too far from~0. For some pairs $(a,b)$ it may -deviate from~0 farther than for other pairs. \NameStatChi{}~is an aggregate measure that -combines squares of these differences across all value pairs: -\begin{equation*} -\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2 -\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}} -\end{equation*} -where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are -the \emph{expected} frequencies for all pairs~$(a,b)$. Under independence (plus other standard -assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for -statistical tests for independence, see~\emph{\NameStatPChi} for details. Note that \NameStatChi{} -does \emph{not} measure the strength of dependence: even very weak dependence may result in a -significant deviation from independence if the counts are large enough. Use~\NameStatV{} instead -to measure the strength of dependence. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Degrees of freedom]: -An integer parameter required for the interpretation of~\NameStatChi{} measure. Under independence -(plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the -sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is -this integer parameter. For a pair of categorical features such that the $1^{\textrm{st}}$~feature -has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees -of freedom is $d = (k_1 - 1)(k_2 - 1)$. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatPChi]: -A measure of how likely we would observe the current frequencies of value pairs of two categorical -features assuming their statistical independence. More precisely, it computes the probability that -the sum of $d$~squares of independent normal random variables with mean~0 and variance~1 -(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large -as the current sample \NameStatChi. The $d$ parameter is \emph{degrees of freedom}, see above. -Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the -$\chi^2$~distribution and is unlikely to land very far into its tail. On the other hand, if the -two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$ -and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample. -\NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi: -\begin{equation*} -P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big] -\end{equation*} -As any probability, $P$ ranges between 0 and~1. If $P\leq 0.05$, the dependence between the two -features may be considered statistically significant (i.e.\ their independence is considered -statistically ruled out). For highly dependent features, it is not unusual to have $P\leq 10^{-20}$ -or less, in which case our script will simply return $P = 0$. Independent features should have -their $P\geq 0.05$ in about 95\% cases. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatV]: -A measure for the strength of association, i.e.\ of statistical dependence, between two categorical -features, conceptually similar to \NameStatR. It divides the observed~\NameStatChi{} by the maximum -possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature, -then takes the square root. Thus, \NameStatV{} ranges from 0 to~1, -where 0 implies no association and 1 implies the maximum possible association (one-to-one -correspondence) between the two features. See \emph{\NameStatChi} for the computation of~$\chi^2$; -its maximum${} = {}$% -$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature -has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV}, -so -\begin{equation*} -\textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}} -\end{equation*} -As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases, -\NameStatV{} goes towards~1 (slowly) as the dependence increases. Both \NameStatChi{} and -\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the -ratio. -\end{Description} - - -\paragraph{Nominal-vs-scale statistics.} -Sample statistics that describe association between a categorical feature -(order ignored) and a quantitative (scale) feature. -The values of the categorical feature must be coded as positive integers. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatEta]: -A measure for the strength of association (statistical dependence) between a nominal feature -and a scale feature, conceptually similar to \NameStatR. Ranges from 0 to~1, approaching 0 -when there is no association and approaching 1 when there is a strong association. -The nominal feature, treated as the independent variable, is assumed to have relatively few -possible values, all with large frequency counts. The scale feature is treated as the dependent -variable. Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have: -\begin{equation*} -\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}, -\,\,\,\,\textrm{where}\,\,\,\, -\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n -\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\! -\end{equation*} -and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean. Value $\hat{y}[x]$ is the average -of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor'' -of $y$ given~$x$. Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error -sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$. -Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the -``R-squared'' statistic measures the accuracy of linear regression. Our output $\eta$ -is the square root of~$\eta^2$. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatF]: -A measure of how much the values of the scale feature, denoted here by~$y$, -deviate from statistical independence on the nominal feature, denoted by~$x$. -The same measure appears in the one-way analysis of vari\-ance (ANOVA). -Like \NameStatChi, \NameStatF{} is used to test the hypothesis that -$y$~is independent from~$x$, given the following assumptions: -\begin{Itemize} -\item The scale feature $y$ has approximately normal distribution whose mean -may depend only on~$x$ and variance is the same for all~$x$; -\item The nominal feature $x$ has relatively small value domain with large -frequency counts, the $x_i$-values are treated as fixed (non-random); -\item All records are sampled independently of each other. -\end{Itemize} -To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$ -among all records where $x_i = x$. These $\hat{y}[x]$ can be viewed as -``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should -``predict'' only the global mean~$\bar{y}$. Then we form two sums-of-squares: -\begin{Itemize} -\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$; -\item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$. -\end{Itemize} -\NameStatF{} is the ratio of the explained sum-of-squares to -the residual sum-of-squares, each divided by their corresponding degrees -of freedom: -\begin{equation*} -F \,\,=\,\, -\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}% -{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\, -\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2} -\end{equation*} -Here $k$ is the domain size of the nominal feature~$x$. The $k$ ``predictors'' -lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly, -the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''. - -The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable; -more generally (with relaxed normality assumptions) it can test the hypothesis that -\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$. -Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution. -But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{} -becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far -into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample. -\end{Description} - - -\paragraph{Ordinal-vs-ordinal statistics.} -Sample statistics that describe association between two ordinal categorical features. -Both features' value domains are encoded with positive integers, so that the natural -order of the integers coincides with the order in each value domain. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it\NameStatRho]: -A measure for the strength of association (statistical dependence) between -two ordinal features, conceptually similar to \NameStatR. Specifically, it is \NameStatR{} -applied to the feature vectors in which all values are replaced by their ranks, i.e.\ -their positions if the vector is sorted. The ranks of identical (duplicate) values -are replaced with their average rank. For example, in vector -$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted -order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average -rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$. - -Our implementation of \NameStatRho{} is geared towards features having small value domains -and large counts for the values. Given the two input vectors, we form a contingency table $T$ -of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$ -and~$f_2$. Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the -order-preserving integer encoding of the feature values. -We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks: -$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$. -Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their -covariance weighted by~$T$, before applying the standard formula for \NameStatR: -\begin{equation*} -\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}} -\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}% -{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}} -\end{equation*} -where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$. -The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction -of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease) -when the other feature increases. The correlation becomes~1 when the two features are -monotonically related. -\end{Description} - - -\smallskip -\noindent{\bf Returns} -\smallskip - -A collection of (potentially) 4 matrices. Each matrix contains bivariate statistics that -resulted from a different combination of feature types. There is one matrix for scale-scale -statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}), -one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics -(includes \NameStatRho). If any of these matrices is not produced, then no pair of columns required -the corresponding type combination. See Table~\ref{table:bivars} for the matrix naming and -format details. - - -\smallskip -\pagebreak[2] - -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt -\hml -f \BivarScriptName{} -nvargs -X=/user/biadmin/X.mtx -index1=/user/biadmin/S1.mtx -index2=/user/biadmin/S2.mtx -types1=/user/biadmin/K1.mtx -types2=/user/biadmin/K2.mtx -OUTDIR=/user/biadmin/stats.mtx - -} -
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms Reference/DescriptiveStats.tex ---------------------------------------------------------------------- diff --git a/docs/Algorithms Reference/DescriptiveStats.tex b/docs/Algorithms Reference/DescriptiveStats.tex deleted file mode 100644 index 5a59ad4..0000000 --- a/docs/Algorithms Reference/DescriptiveStats.tex +++ /dev/null @@ -1,115 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}} -\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}} - -\newcommand{\OutputRowIDMinimum}{1} -\newcommand{\OutputRowIDMaximum}{2} -\newcommand{\OutputRowIDRange}{3} -\newcommand{\OutputRowIDMean}{4} -\newcommand{\OutputRowIDVariance}{5} -\newcommand{\OutputRowIDStDeviation}{6} -\newcommand{\OutputRowIDStErrorMean}{7} -\newcommand{\OutputRowIDCoeffVar}{8} -\newcommand{\OutputRowIDQuartiles}{?, 13, ?} -\newcommand{\OutputRowIDMedian}{13} -\newcommand{\OutputRowIDIQMean}{14} -\newcommand{\OutputRowIDSkewness}{9} -\newcommand{\OutputRowIDKurtosis}{10} -\newcommand{\OutputRowIDStErrorSkewness}{11} -\newcommand{\OutputRowIDStErrorCurtosis}{12} -\newcommand{\OutputRowIDNumCategories}{15} -\newcommand{\OutputRowIDMode}{16} -\newcommand{\OutputRowIDNumModes}{17} -\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}} - -\newcommand{\NameStatR}{Pearson's correlation coefficient} -\newcommand{\NameStatChi}{Pearson's~$\chi^2$} -\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$} -\newcommand{\NameStatV}{Cram\'er's~$V$} -\newcommand{\NameStatEta}{Eta statistic} -\newcommand{\NameStatF}{$F$~statistic} -\newcommand{\NameStatRho}{Spearman's rank correlation coefficient} - -Descriptive statistics are used to quantitatively describe the main characteristics of the data. -They provide meaningful summaries computed over different observations or data records -collected in a study. These summaries typically form the basis of the initial data exploration -as part of a more extensive statistical analysis. Such a quantitative analysis assumes that -every variable (also known as, attribute, feature, or column) in the data has a specific -\emph{level of measurement}~\cite{Stevens1946:scales}. - -The measurement level of a variable, often called as {\bf variable type}, can either be -\emph{scale} or \emph{categorical}. A \emph{scale} variable represents the data measured on -an interval scale or ratio scale. Examples of scale variables include `Height', `Weight', -`Salary', and `Temperature'. Scale variables are also referred to as \emph{quantitative} -or \emph{continuous} variables. In contrast, a \emph{categorical} variable has a fixed -limited number of distinct values or categories. Examples of categorical variables -include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'. -Categorical variables can further be classified into two types, \emph{nominal} and -\emph{ordinal}, depending on whether the categories in the variable can be ordered via an -intrinsic ranking. For example, there is no meaningful ranking among distinct values in -`Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from -highly dissatisfied to highly satisfied. - -The input dataset for descriptive statistics is provided in the form of a matrix, whose -rows are the records (data points) and whose columns are the features (i.e.~variables). -Some scripts allow this matrix to be vertically split into two or three matrices. Descriptive -statistics are computed over the specified features (columns) in the matrix. Which -statistics are computed depends on the types of the features. It is important to keep -in mind the following caveats and restrictions: -\begin{Enumerate} -\item Given a finite set of data records, i.e.~a \emph{sample}, we take their feature -values and compute their \emph{sample statistics}. These statistics -will vary from sample to sample even if the underlying distribution of feature values -remains the same. Sample statistics are accurate for the given sample only. -If the goal is to estimate the \emph{distribution statistics} that are parameters of -the (hypothesized) underlying distribution of the features, the corresponding sample -statistics may sometimes be used as approximations, but their accuracy will vary. -\item In particular, the accuracy of the estimated distribution statistics will be low -if the number of values in the sample is small. That is, for small samples, the computed -statistics may depend on the randomness of the individual sample values more than on -the underlying distribution of the features. -\item The accuracy will also be low if the sample records cannot be assumed mutually -independent and identically distributed (i.i.d.), that is, sampled at random from the -same underlying distribution. In practice, feature values in one record often depend -on other features and other records, including unknown ones. -\item Most of the computed statistics will have low estimation accuracy in the presence of -extreme values (outliers) or if the underlying distribution has heavy tails, for example -obeys a power law. However, a few of the computed statistics, such as the median and -\NameStatRho{}, are \emph{robust} to outliers. -\item Some sample statistics are reported with their \emph{sample standard errors} -in an attempt to quantify their accuracy as distribution parameter estimators. But these -sample standard errors, in turn, only estimate the underlying distribution's standard -errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples, -or heavy-tailed distributions. -\item We assume that the quantitative (scale) feature columns do not contain missing -values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise -specified. We assume that each categorical feature column contains positive integers -from 1 to the number of categories; for ordinal features, the natural order on -the integers should coincide with the order on the categories. -\end{Enumerate} - -\input{DescriptiveUnivarStats} - -\input{DescriptiveBivarStats} - -\input{DescriptiveStratStats} http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms Reference/DescriptiveStratStats.tex ---------------------------------------------------------------------- diff --git a/docs/Algorithms Reference/DescriptiveStratStats.tex b/docs/Algorithms Reference/DescriptiveStratStats.tex deleted file mode 100644 index be0cffd..0000000 --- a/docs/Algorithms Reference/DescriptiveStratStats.tex +++ /dev/null @@ -1,306 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Stratified Bivariate Statistics} - -\noindent{\bf Description} -\smallskip - -The {\tt stratstats.dml} script computes common bivariate statistics, such -as correlation, slope, and their p-value, in parallel for many pairs of input -variables in the presence of a confounding categorical variable. The values -of this confounding variable group the records into strata (subpopulations), -in which all bivariate pairs are assumed free of confounding. The script -uses the same data model as in one-way analysis of covariance (ANCOVA), with -strata representing population samples. It also outputs univariate stratified -and bivariate unstratified statistics. - -\begin{table}[t]\hfil -\begin{tabular}{|l|ll|ll|ll||ll|} -\hline -Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} & - \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\ -Customers, millions & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\ -\hline -Promotion (0 or 1) & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ -Avg.\ sales per 1000 & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\ -\hline -\end{tabular}\hfil -\caption{Stratification example: the effect of the promotion on average sales -becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.} -\label{table:stratexample} -\end{table} - -To see how data stratification mitigates confounding, consider an (artificial) -example in Table~\ref{table:stratexample}. A highly seasonal retail item -was marketed with and without a promotion over the final 3~months of the year. -In each month the sale was more likely with the promotion than without it. -But during the peak holiday season, when shoppers came in greater numbers and -bought the item more often, the promotion was less frequently used. As a result, -if the 4-th quarter data is pooled together, the promotion's effect becomes -reversed and magnified. Stratifying by month restores the positive correlation. - -The script computes its statistics in parallel over all possible pairs from two -specified sets of covariates. The 1-st covariate is a column in input matrix~$X$ -and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may -be the same or different. The columns of interest are given by their index numbers -in special matrices. The stratum column, specified in its own matrix, is the same -for all covariate pairs. - -Both covariates in each pair must be numerical, with the 2-nd covariate normally -distributed given the 1-st covariate (see~Details). Missing covariate values or -strata are represented by~``NaN''. Records with NaN's are selectively omitted -wherever their NaN's are material to the output statistic. - -\smallskip -\pagebreak[3] - -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}stratstats.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Xcid=}path/file -{\tt{} Y=}path/file -{\tt{} Ycid=}path/file -{\tt{} S=}path/file -{\tt{} Scid=}int -{\tt{} O=}path/file -{\tt{} fmt=}format - -} - - -\smallskip -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read matrix $X$ whose columns we want to use as -the 1-st covariate (i.e.~as the feature variable) -\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "}) -Location to read the single-row matrix that lists all index numbers -of the $X$-columns used as the 1-st covariate; the default value means -``use all $X$-columns'' -\item[{\tt Y}:] (default:\mbox{ }{\tt " "}) -Location to read matrix $Y$ whose columns we want to use as the 2-nd -covariate (i.e.~as the response variable); the default value means -``use $X$ in place of~$Y$'' -\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "}) -Location to read the single-row matrix that lists all index numbers -of the $Y$-columns used as the 2-nd covariate; the default value means -``use all $Y$-columns'' -\item[{\tt S}:] (default:\mbox{ }{\tt " "}) -Location to read matrix $S$ that has the stratum column. -Note: the stratum column must contain small positive integers; all fractional -values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as -missing. The default value for {\tt S} means ``use $X$ in place of~$S$'' -\item[{\tt Scid}:] (default:\mbox{ }{\tt 1}) -The index number of the stratum column in~$S$ -\item[{\tt O}:] -Location to store the output matrix defined in Table~\ref{table:stratoutput} -\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -see read/write functions in SystemML Language Reference for details. -\end{Description} - - -\begin{table}[t]\small\hfil -\begin{tabular}{|rcl|rcl|} -\hline -& Col.\# & Meaning & & Col.\# & Meaning \\ -\hline -\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em} -& 01 & $X$-column number & -\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em} -& 11 & $Y$-column number \\ -& 02 & presence count for $x$ & -& 12 & presence count for $y$ \\ -& 03 & global mean $(x)$ & -& 13 & global mean $(y)$ \\ -& 04 & global std.\ dev. $(x)$ & -& 14 & global std.\ dev. $(y)$ \\ -& 05 & stratified std.\ dev. $(x)$ & -& 15 & stratified std.\ dev. $(y)$ \\ -& 06 & $R^2$ for $x \sim {}$strata & -& 16 & $R^2$ for $y \sim {}$strata \\ -& 07 & adjusted $R^2$ for $x \sim {}$strata & -& 17 & adjusted $R^2$ for $y \sim {}$strata \\ -& 08 & p-value, $x \sim {}$strata & -& 18 & p-value, $y \sim {}$strata \\ -& 09--10 & reserved & -& 19--20 & reserved \\ -\hline -\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em} -& 21 & presence count $(x, y)$ & -\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em} -& 31 & presence count $(x, y, s)$ \\ -& 22 & regression slope & -& 32 & regression slope \\ -& 23 & regres.\ slope std.\ dev. & -& 33 & regres.\ slope std.\ dev. \\ -& 24 & correlation${} = \pm\sqrt{R^2}$ & -& 34 & correlation${} = \pm\sqrt{R^2}$ \\ -& 25 & residual std.\ dev. & -& 35 & residual std.\ dev. \\ -& 26 & $R^2$ in $y$ due to $x$ & -& 36 & $R^2$ in $y$ due to $x$ \\ -& 27 & adjusted $R^2$ in $y$ due to $x$ & -& 37 & adjusted $R^2$ in $y$ due to $x$ \\ -& 28 & p-value for ``slope = 0'' & -& 38 & p-value for ``slope = 0'' \\ -& 29 & reserved & -& 39 & \# strata with ${\geq}\,2$ count \\ -& 30 & reserved & -& 40 & reserved \\ -\hline -\end{tabular}\hfil -\caption{The {\tt stratstats.dml} output matrix has one row per each distinct -pair of 1-st and 2-nd covariates, and 40 columns with the statistics described -here.} -\label{table:stratoutput} -\end{table} - - - - -\noindent{\bf Details} -\smallskip - -Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is -a stratum number and $(x, y)$ are two numerical covariates. We want to analyze -conditional linear relationship between $y$ and $x$ conditioned by~$i$. -Note that $x$, but not~$y$, may represent a categorical variable if we assign a -numerical value to each category, for example 0 and 1 for two categories. - -We assume a linear regression model for~$y$: -\begin{equation} -y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\, -\eps_{i,j} \sim \Normal(0, \sigma^2) -\label{eqn:stratlinmodel} -\end{equation} -Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number -in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$. -The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata. -When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in -stratum~$i$ as -\begin{equation*} -\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad -\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i -\end{equation*} -If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$, -which gives the prediction error sum-of-squares of -\begin{equation} -\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2 -\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y -\label{eqn:stratsumsq} -\end{equation} -where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample -estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as -\begin{align*} -V_x \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad -V_y \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\ -V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big) -\end{align*} -They are stratified because we compute the sample (co-)variances in each stratum~$i$ -separately, then combine by summation. The stratified estimates for $\Var(X)$ and $\Var(Y)$ -tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$ -and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$ -than the global means. The stratified variance estimates the uncertainty in $x_{i,j}$ -and~$y_{i,j}$ given their stratum~$i$. - -Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq}) -gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$}, -with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS): -\begin{equation*} -\mathrm{RSS} \,\,=\, \, -\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - -\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2 -\,\,=\,\, V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big) -\end{equation*} -The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction -of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear -regression model~(\ref{eqn:stratlinmodel}). We define \emph{stratified correlation} as the -square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$. We also use RSS to estimate -the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error -of $y_{i,j}$ given $x_{i,j}$ and the stratum: -\begin{equation*} -\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}}; -\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y}; -\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\, -\Big(n = \sum_{i=1}^k n_i\Big) -\end{equation*} - -The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are -obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares, -measured by the decrease from $V_y$ to~RSS. -The $F$-statistic is the ratio of the ``explained'' sum-of-squares -to the residual sum-of-squares, divided by their corresponding degrees of freedom. -There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that -to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree -of freedom: -\begin{equation*} -F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)} -\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad -t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}. -\end{equation*} -The $t$-statistic is simply the square root of the $F$-statistic with the appropriate -choice of sign. If the null hypothesis and the linear model are both true, the $t$-statistic -has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom. We can -also compute it if we divide $\hat{\beta}$ by its estimated standard deviation: -\begin{equation*} -\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad -t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}} -\end{equation*} -The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output. - -\smallskip -\noindent{\bf Returns} -\smallskip - -The output matrix format is defined in Table~\ref{table:stratoutput}. - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt -\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx - Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2 - O=/user/biadmin/Out.mtx fmt=csv - -} -{\hangindent=\parindent\noindent\tt -\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx - Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx - -} - -%\smallskip -%\noindent{\bf See Also} -%\smallskip -% -%For non-stratified bivariate statistics with a wider variety of input data types -%and statistical tests, see \ldots. For general linear regression, see -%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}. For logistic regression, appropriate -%when the response variable is categorical, see {\tt MultiLogReg.dml}. - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4e9699e4/docs/Algorithms Reference/DescriptiveUnivarStats.tex ---------------------------------------------------------------------- diff --git a/docs/Algorithms Reference/DescriptiveUnivarStats.tex b/docs/Algorithms Reference/DescriptiveUnivarStats.tex deleted file mode 100644 index 5838e3e..0000000 --- a/docs/Algorithms Reference/DescriptiveUnivarStats.tex +++ /dev/null @@ -1,603 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Univariate Statistics} - -\noindent{\bf Description} -\smallskip - -\emph{Univariate statistics} are the simplest form of descriptive statistics in data -analysis. They are used to quantitatively describe the main characteristics of each -feature in the data. For a given dataset matrix, script \UnivarScriptName{} computes -certain univariate statistics for each feature column in the -matrix. The feature type governs the exact set of statistics computed for that feature. -For example, the statistic \emph{mean} can only be computed on a quantitative (scale) -feature like `Height' and `Temperature'. It does not make sense to compute the mean -of a categorical attribute like `Hair Color'. - - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it%\tolerance=0 -{\tt{}-f } \UnivarScriptName{} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} TYPES=}path/file -{\tt{} STATS=}path/file -% {\tt{} fmt=}format - -} - - -\medskip -\pagebreak[2] -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read the data matrix $X$ whose columns we want to -analyze as the features. -\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "}) -Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$ -column-cell contains the type of the $i^{\textrm{th}}$ feature column -\texttt{X[,$\,i$]} in the data matrix. Feature types must be encoded by -integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. -% The default value means ``treat all $X$-columns as scale.'' -\item[{\tt STATS}:] -Location (on HDFS) where the output matrix of computed statistics -will be stored. The format of the output matrix is defined by -Table~\ref{table:univars}. -% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -% see read/write functions in SystemML Language Reference for details. -\end{Description} - -\begin{table}[t]\hfil -\begin{tabular}{|rl|c|c|} -\hline -\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\ - & & Scale & Categ.\\ -\hline -\OutputRowIDMinimum & Minimum & + & \\ -\OutputRowIDMaximum & Maximum & + & \\ -\OutputRowIDRange & Range & + & \\ -\OutputRowIDMean & Mean & + & \\ -\OutputRowIDVariance & Variance & + & \\ -\OutputRowIDStDeviation & Standard deviation & + & \\ -\OutputRowIDStErrorMean & Standard error of mean & + & \\ -\OutputRowIDCoeffVar & Coefficient of variation & + & \\ -\OutputRowIDSkewness & Skewness & + & \\ -\OutputRowIDKurtosis & Kurtosis & + & \\ -\OutputRowIDStErrorSkewness & Standard error of skewness & + & \\ -\OutputRowIDStErrorCurtosis & Standard error of kurtosis & + & \\ -\OutputRowIDMedian & Median & + & \\ -\OutputRowIDIQMean & Inter quartile mean & + & \\ -\OutputRowIDNumCategories & Number of categories & & + \\ -\OutputRowIDMode & Mode & & + \\ -\OutputRowIDNumModes & Number of modes & & + \\ -\hline -\end{tabular}\hfil -\caption{The output matrix of \UnivarScriptName{} has one row per each -univariate statistic and one column per input feature. This table lists -the meaning of each row. Signs ``+'' show applicability to scale or/and -to categorical features.} -\label{table:univars} -\end{table} - - -\pagebreak[1] - -\smallskip -\noindent{\bf Details} -\smallskip - -Given an input matrix \texttt{X}, this script computes the set of all -relevant univariate statistics for each feature column \texttt{X[,$\,i$]} -in~\texttt{X}. The list of statistics to be computed depends on the -\emph{type}, or \emph{measurement level}, of each column. -The \textrm{TYPES} command-line argument points to a vector containing -the types of all columns. The types must be provided as per the following -convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. - -Below we list all univariate statistics computed by script \UnivarScriptName. -The statistics are collected by relevance into several groups, namely: central -tendency, dispersion, shape, and categorical measures. The first three groups -contain statistics computed for a quantitative (also known as: numerical, scale, -or continuous) feature; the last group contains the statistics for a categorical -(either nominal or ordinal) feature. - -Let~$n$ be the number of data records (rows) with feature values. -In what follows we fix a column index \texttt{idx} and consider -sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}. -Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]} -in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$. -Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order, -preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. - -\paragraph{Central tendency measures.} -Sample statistics that describe the location of the quantitative (scale) feature distribution, -represent it with a single value. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Mean] -\OutputRowText{\OutputRowIDMean} -The arithmetic average over a sample of a quantitative feature. -Computed as the ratio between the sum of values and the number of values: -$\left(\sum_{i=1}^n v_i\right)\!/n$. -Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals~5.2. - -Note that the mean is significantly affected by extreme values in the sample -and may be misleading as a central tendency measure if the feature varies on -exponential scale. For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$ -is 22.222, greater than all the sample values except the~largest. -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\begin{figure}[t] -\setlength{\unitlength}{10pt} -\begin{picture}(33,12) -\put( 6.2, 0.0){\small 2.2} -\put(10.2, 0.0){\small 3.2} -\put(12.2, 0.0){\small 3.7} -\put(15.0, 0.0){\small 4.4} -\put(18.6, 0.0){\small 5.3} -\put(20.2, 0.0){\small 5.7} -\put(21.75,0.0){\small 6.1} -\put(23.05,0.0){\small 6.4} -\put(26.2, 0.0){\small 7.2} -\put(28.6, 0.0){\small 7.8} -\put( 0.5, 0.7){\small 0.0} -\put( 0.1, 3.2){\small 0.25} -\put( 0.5, 5.7){\small 0.5} -\put( 0.1, 8.2){\small 0.75} -\put( 0.5,10.7){\small 1.0} -\linethickness{1.5pt} -\put( 2.0, 1.0){\line(1,0){4.8}} -\put( 6.8, 1.0){\line(0,1){1.0}} -\put( 6.8, 2.0){\line(1,0){4.0}} -\put(10.8, 2.0){\line(0,1){1.0}} -\put(10.8, 3.0){\line(1,0){2.0}} -\put(12.8, 3.0){\line(0,1){1.0}} -\put(12.8, 4.0){\line(1,0){2.8}} -\put(15.6, 4.0){\line(0,1){1.0}} -\put(15.6, 5.0){\line(1,0){3.6}} -\put(19.2, 5.0){\line(0,1){1.0}} -\put(19.2, 6.0){\line(1,0){1.6}} -\put(20.8, 6.0){\line(0,1){1.0}} -\put(20.8, 7.0){\line(1,0){1.6}} -\put(22.4, 7.0){\line(0,1){1.0}} -\put(22.4, 8.0){\line(1,0){1.2}} -\put(23.6, 8.0){\line(0,1){1.0}} -\put(23.6, 9.0){\line(1,0){3.2}} -\put(26.8, 9.0){\line(0,1){1.0}} -\put(26.8,10.0){\line(1,0){2.4}} -\put(29.2,10.0){\line(0,1){1.0}} -\put(29.2,11.0){\line(1,0){4.8}} -\linethickness{0.3pt} -\put( 6.8, 1.0){\circle*{0.3}} -\put(10.8, 1.0){\circle*{0.3}} -\put(12.8, 1.0){\circle*{0.3}} -\put(15.6, 1.0){\circle*{0.3}} -\put(19.2, 1.0){\circle*{0.3}} -\put(20.8, 1.0){\circle*{0.3}} -\put(22.4, 1.0){\circle*{0.3}} -\put(23.6, 1.0){\circle*{0.3}} -\put(26.8, 1.0){\circle*{0.3}} -\put(29.2, 1.0){\circle*{0.3}} -\put( 6.8, 1.0){\vector(1,0){27.2}} -\put( 2.0, 1.0){\vector(0,1){10.8}} -\put( 2.0, 3.5){\line(1,0){10.8}} -\put( 2.0, 6.0){\line(1,0){17.2}} -\put( 2.0, 8.5){\line(1,0){21.6}} -\put( 2.0,11.0){\line(1,0){27.2}} -\put(12.8, 1.0){\line(0,1){2.0}} -\put(19.2, 1.0){\line(0,1){5.0}} -\put(20.0, 1.0){\line(0,1){5.0}} -\put(23.6, 1.0){\line(0,1){7.0}} -\put( 9.0, 4.0){\line(1,0){3.8}} -\put( 9.2, 2.7){\vector(0,1){0.8}} -\put( 9.2, 4.8){\vector(0,-1){0.8}} -\put(19.4, 8.0){\line(1,0){3.0}} -\put(19.6, 7.2){\vector(0,1){0.8}} -\put(19.6, 9.3){\vector(0,-1){0.8}} -\put(13.0, 2.2){\small $q_{25\%}$} -\put(17.3, 2.2){\small $q_{50\%}$} -\put(23.8, 2.2){\small $q_{75\%}$} -\put(20.15,3.5){\small $\mu$} -\put( 8.0, 3.75){\small $\phi_1$} -\put(18.35,7.8){\small $\phi_2$} -\end{picture} -\label{fig:example_quartiles} -\caption{The computation of quartiles, median, and interquartile mean from the -empirical distribution function of the 10-point -sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$. Each vertical step in -the graph has height~$1{/}n = 0.1$. Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote -the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly; -value~$\mu$ denotes the median. Values $\phi_1$ and $\phi_2$ show the partial contribution -of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.} -\end{figure} - -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Median] -\OutputRowText{\OutputRowIDMedian} -The ``middle'' value that separates the higher half of the sample values -(in a sorted order) from the lower half. -To compute the median, we sort the sample in the increasing order, preserving -duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. -If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$, -same as the $50^{\textrm{th}}$~percentile of the sample. -If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$, -so we compute the median as the mean of these two values. -(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$, -not as the median.) Example: the median of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}. - -Unlike the mean, the median is not sensitive to extreme values in the sample, -i.e.\ it is robust to outliers. It works better as a measure of central tendency -for heavy-tailed distributions and features that vary on exponential scale. -However, the median is sensitive to small sample size. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Interquartile mean] -\OutputRowText{\OutputRowIDIQMean} -For a sample of a quantitative feature, this is -the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile -and less than or equal the $3^{\textrm{rd}}$ quartile. -In other words, it is a ``truncated mean'' where the lowest 25$\%$ and -the highest 25$\%$ of the sorted values are omitted in its computation. -The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ -quartiles themselves, contribute to this mean only partially. -This measure is occasionally used as the ``robust'' version of the mean -that is less sensitive to the extreme values. - -To compute the measure, we sort the sample in the increasing order, -preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. -We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index -and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index, -then compute the following weighted mean: -\begin{equation*} -\frac{1}{3{/}4 - 1{/}4} \left[ -\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ -\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i -\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right] -\end{equation*} -In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ -quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the -two quartiles themselves enter the sum with reduced weights. The weights are proportional -to the vertical steps in the empirical distribution function of the sample, see -Figure~\ref{fig:example_quartiles} for an illustration. -Example: the interquartile mean of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum -$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$, -which equals~5.31. -\end{Description} - - -\paragraph{Dispersion measures.} -Statistics that describe the amount of variation or spread in a quantitative -(scale) data feature. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Variance] -\OutputRowText{\OutputRowIDVariance} -A measure of dispersion, or spread-out, of sample values around their mean, -expressed in units that are the square of those of the feature itself. -Computed as the sum of squared differences between the values -in the sample and their mean, divided by one less than the number of -values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where -$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. -Example: the variance of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24. -Note that at least two values ($n\geq 2$) are required to avoid division -by zero. Sample variance is sensitive to outliers, even more than the mean. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Standard deviation] -\OutputRowText{\OutputRowIDStDeviation} -A measure of dispersion around the mean, the square root of variance. -Computed by taking the square root of the sample variance; -see \emph{Variance} above on computing the variance. -Example: the standard deviation of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8. -At least two values are required to avoid division by zero. -Note that standard deviation is sensitive to outliers. - -Standard deviation is used in conjunction with the mean to determine -an interval containing a given percentage of the feature values, -assuming the normal distribution. In a large sample from a normal -distribution, around 68\% of the cases fall within one standard -deviation and around 95\% of cases fall within two standard deviations -of the mean. For example, if the mean age is 45 with a standard deviation -of 10, around 95\% of the cases would be between 25 and 65 in a normal -distribution. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Coefficient of variation] -\OutputRowText{\OutputRowIDCoeffVar} -The ratio of the standard deviation to the mean, i.e.\ the -\emph{relative} standard deviation, of a quantitative feature sample. -Computed by dividing the sample \emph{standard deviation} by the -sample \emph{mean}, see above for their computation details. -Example: the coefficient of variation for sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals 1.8$\,{/}\,$5.2~${\approx}$~0.346. - -This metric is used primarily with non-negative features such as -financial or population data. It is sensitive to outliers. -Note: zero mean causes division by zero, returning infinity or \texttt{NaN}. -At least two values (records) are required to compute the standard deviation. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Minimum] -\OutputRowText{\OutputRowIDMinimum} -The smallest value of a quantitative sample, computed as $\min v = v^s_1$. -Example: the minimum of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals~2.2. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Maximum] -\OutputRowText{\OutputRowIDMaximum} -The largest value of a quantitative sample, computed as $\max v = v^s_n$. -Example: the maximum of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals~7.8. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Range] -\OutputRowText{\OutputRowIDRange} -The difference between the largest and the smallest value of a quantitative -sample, computed as $\max v - \min v = v^s_n - v^s_1$. -It provides information about the overall spread of the sample values. -Example: the range of sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -equals 7.8$\,{-}\,$2.2~${=}$~5.6. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Standard error of the mean] -\OutputRowText{\OutputRowIDStErrorMean} -A measure of how much the value of the sample mean may vary from sample -to sample taken from the same (hypothesized) distribution of the feature. -It helps to roughly bound the distribution mean, i.e.\ -the limit of the sample mean as the sample size tends to infinity. -Under certain assumptions (e.g.\ normality and large sample), the difference -between the distribution mean and the sample mean is unlikely to exceed -2~standard errors. - -The measure is computed by dividing the sample standard deviation -by the square root of the number of values~$n$; see \emph{standard deviation} -for its computation details. Ensure $n\,{\geq}\,2$ to avoid division by~0. -Example: for sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -with the mean of~5.2 the standard error of the mean -equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569. - -Note that the standard error itself is subject to sample randomness. -Its accuracy as an error estimator may be low if the sample size is small -or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has -heavy tails. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -% \item[\it Quartiles] -% \OutputRowText{\OutputRowIDQuartiles} -% %%% dsDefn %%%% -% The values of a quantitative feature -% that divide an ordered/sorted set of data records into four equal-size groups. -% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits -% the sorted data into the lowest $25\%$ and the highest~$75\%$. In other words, -% it is the middle value between the minimum and the median. The $2^{\textrm{nd}}$ -% quartile is the median itself, the value that separates the higher half of -% the data (in the sorted order) from the lower half. Finally, the $3^{\textrm{rd}}$ -% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into -% lowest $75\%$ and highest~$25\%$.\par -% %%% dsComp %%%% -% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values -% we sort it in the increasing order, preserving duplicates, then return -% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]} -% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$. -% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted -% to equal the mean of two middle values -% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and -% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$. -% %%% dsWarn %%%% -% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values. -% %%% dsExmpl %%% -% \textbf{Example(s).} -\end{Description} - - -\paragraph{Shape measures.} -Statistics that describe the shape and symmetry of the quantitative (scale) -feature distribution estimated from a sample of its values. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Skewness] -\OutputRowText{\OutputRowIDSkewness} -It measures how symmetrically the values of a feature are spread out -around the mean. A significant positive skewness implies a longer (or fatter) -right tail, i.e. feature values tend to lie farther away from the mean on the -right side. A significant negative skewness implies a longer (or fatter) left -tail. The normal distribution is symmetric and has a skewness value of~0; -however, its sample skewness is likely to be nonzero, just close to zero. -As a guideline, a skewness value more than twice its standard error is taken -to indicate a departure from symmetry. - -Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube -of the standard deviation. We estimate the $3^{\textrm{rd}}$~central moment as -the sum of cubed differences between the values in the feature column and their -sample mean, divided by the number of values: -$\sum_{i=1}^n (v_i - \bar{v})^3 / n$ -where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. -The standard deviation is computed -as described above in \emph{standard deviation}. To avoid division by~0, -at least two different sample values are required. Example: for sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -with the mean of~5.2 and the standard deviation of~1.8 -skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$. -Note: skewness is sensitive to outliers. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Standard error in skewness] -\OutputRowText{\OutputRowIDStErrorSkewness} -A measure of how much the sample skewness may vary from sample to sample, -assuming that the feature is normally distributed, which makes its -distribution skewness equal~0. -Given the number~$n$ of sample values, the standard error is computed as -\begin{equation*} -\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}} -\end{equation*} -This measure can tell us, for example: -\begin{Itemize} -\item If the sample skewness lands within two standard errors from~0, its -positive or negative sign is non-significant, may just be accidental. -\item If the sample skewness lands outside this interval, the feature -is unlikely to be normally distributed. -\end{Itemize} -At least 3~values ($n\geq 3$) are required to avoid arithmetic failure. -Note that the standard error is inaccurate if the feature distribution is -far from normal or if the number of samples is small. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Kurtosis] -\OutputRowText{\OutputRowIDKurtosis} -As a distribution parameter, kurtosis is a measure of the extent to which -feature values cluster around a central point. In other words, it quantifies -``peakedness'' of the distribution: how tall and sharp the central peak is -relative to a standard bell curve. - -Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative -to a normal distribution: -\begin{Itemize} -\item observations cluster more about the center (peak-shaped), -\item the tails are thinner at non-extreme values, -\item the tails are thicker at extreme values. -\end{Itemize} -Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative -to a normal distribution: -\begin{Itemize} -\item observations cluster less about the center (box-shaped), -\item the tails are thicker at non-extreme values, -\item the tails are thinner at extreme values. -\end{Itemize} -Kurtosis of a normal distribution is zero; however, the sample kurtosis -(computed here) is likely to deviate from zero. - -Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided -by the $4^{\textrm{th}}$~power of the standard deviation, minus~3. -We estimate the $4^{\textrm{th}}$~central moment as the sum of the -$4^{\textrm{th}}$~powers of differences between the values in the feature column -and their sample mean, divided by the number of values: -$\sum_{i=1}^n (v_i - \bar{v})^4 / n$ -where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. -The standard deviation is computed as described above, see \emph{standard deviation}. - -Note that kurtosis is sensitive to outliers, and requires at least two different -sample values. Example: for sample -$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ -with the mean of~5.2 and the standard deviation of~1.8, -sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Standard error in kurtosis] -\OutputRowText{\OutputRowIDStErrorCurtosis} -A measure of how much the sample kurtosis may vary from sample to sample, -assuming that the feature is normally distributed, which makes its -distribution kurtosis equal~0. -Given the number~$n$ of sample values, the standard error is computed as -\begin{equation*} -\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}} -\end{equation*} -This measure can tell us, for example: -\begin{Itemize} -\item If the sample kurtosis lands within two standard errors from~0, its -positive or negative sign is non-significant, may just be accidental. -\item If the sample kurtosis lands outside this interval, the feature -is unlikely to be normally distributed. -\end{Itemize} -At least 4~values ($n\geq 4$) are required to avoid arithmetic failure. -Note that the standard error is inaccurate if the feature distribution is -far from normal or if the number of samples is small. -\end{Description} - - -\paragraph{Categorical measures.} Statistics that describe the sample of -a categorical feature, either nominal or ordinal. We represent all -categories by integers from~1 to the number of categories; we call -these integers \emph{category~IDs}. -\begin{Description} -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Number of categories] -\OutputRowText{\OutputRowIDNumCategories} -The maximum category~ID that occurs in the sample. Note that some -categories with~IDs \emph{smaller} than this maximum~ID may have -no~occurrences in the sample, without reducing the number of categories. -However, any categories with~IDs \emph{larger} than the maximum~ID with -no occurrences in the sample will not be counted. -Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ -the number of categories is reported as~8. Category~IDs 2 and~6, which have -zero occurrences, are still counted; but if there is a category with -ID${}=9$ and zero occurrences, it is not counted. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Mode] -\OutputRowText{\OutputRowIDMode} -The most frequently occurring category value. -If several values share the greatest frequency of occurrence, then each -of them is a mode; but here we report only the smallest of these modes. -Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ -the modes are 3 and~7, with 3 reported. - -Computed by counting the number of occurrences for each category, -then taking the smallest category~ID that has the maximum count. -Note that the sample modes may be different from the distribution modes, -i.e.\ the categories whose (hypothesized) underlying probability is the -maximum over all categories. -%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% -\item[\it Number of modes] -\OutputRowText{\OutputRowIDNumModes} -The number of category values that each have the largest frequency -count in the sample. -Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ -there are two category IDs (3 and~7) that occur the maximum count of 4~times; -hence, we return~2. - -Computed by counting the number of occurrences for each category, -then counting how many categories have the maximum count. -Note that the sample modes may be different from the distribution modes, -i.e.\ the categories whose (hypothesized) underlying probability is the -maximum over all categories. -\end{Description} - - -\smallskip -\noindent{\bf Returns} -\smallskip - -The output matrix containing all computed statistics is of size $17$~rows and -as many columns as in the input matrix~\texttt{X}. Each row corresponds to -a particular statistic, according to the convention specified in -Table~\ref{table:univars}. The first $14$~statistics are applicable for -\emph{scale} columns, and the last $3$~statistics are applicable for categorical, -i.e.\ nominal and ordinal, columns. - - -\pagebreak[2] - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt -\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx - TYPES=/user/biadmin/types.mtx - STATS=/user/biadmin/stats.mtx - -}