On Sun, Mar 8, 2009 at 10:06 AM, Søren Hauberg <so...@hauberg.org> wrote:
> søn, 08 03 2009 kl. 09:40 +0100, skrev Jaroslav Hajek:
>> 1. "use all" -> "all" etc - I think this is more Octavish
>
> Agreed.
>
>> 2. covariances of zero-length vectors are returned as NA. covariances
>> of length 1 vectors are zero.
>
> Makes sense.
>
>> 3. vectorizing the "pairs" case was really tricky (due to NaN/Inf/NA
>> issues), but I think I got there in the end. I welcome testing.
>
> I tried the following:
>
> ## Create data
> data = rand (10, 2);
> na_data = data;
> na_data (6, 1) = na_data (7, 2) = NA;
>
> ## Compute covariances
> c1 = cov (na_data, "complete");
> c2 = cov (na_data, "pairs");
>
> I get
>
> c1 =
>
> 0.062607 0.042061
> 0.042061 0.081121
>
> which seems right, but
>
> c2 =
>
> NaN NaN
> NaN NaN
>
> which doesn't really seem right.
>
OK, I see the problem. Attached an update.
>> PS. this shows that for "cov", the penalty incurred by NA handling is
>> nontrivial, especially for "pairs". Further, it is not clear which one
>> of "complete" or "pairs" should be the default.
>
> I actually think "all" should be default as this is the compatible
> behaviour. This is also what R does, so statisticians should be happy.
>
Agreed.
> [a couple of minutes later]
>
> On modern processors NaN (and hence NA) handling is really slow. So,
> just to get an idea of how this influences performance I did
>
> octave:20> data = rand (10000, 20);
> octave:21> na_data = data; na_data (6, 1) = na_data (7, 2) = NA;
> octave:22> tic, cov (data); toc
> Elapsed time is 0.0366599 seconds.
> octave:23> tic, cov (na_data); toc
> Elapsed time is 0.216626 seconds.
> octave:24> tic, cov (na_data, "complete"); toc
> Elapsed time is 0.055954 seconds.
>
> So, removing NA's actually speed up the computation, while providing a
> more sensible result. Of course, when NA's aren't present the cost of
> checking for NA's is present. Hmm, now I'm not sure about the default
> behaviour...
>
Interesting, but maybe it can't be generalized? I get:
octave:1> data = rand (10000, 20);
octave:2> na_data = data; na_data (6, 1) = na_data (7, 2) = NA;
octave:3> tic, cov (data); toc
Elapsed time is 0.00816202 seconds.
octave:4> tic, cov (na_data); toc
Elapsed time is 0.00428605 seconds.
octave:5> tic, cov (na_data, "complete"); toc
Elapsed time is 0.00770807 seconds.
Do I have a less modern processor?
>> I think this and
>> Matlab/R compatibility sums up to just not care about missing values
>> by default. For consistency, we should probably do the same for mean,
>> std etc.
>>
>> Opinions?
>
> I think the most important point of this thread is that it seems
> reasonable/possible to skip NA's in statistical functions. So, I guess
> it makes sense to discuss doing this at the maintainers list to get a
> feel of the general opinion of doing this.
>
> Søren
>
Hey I thought we were on the list :(
Agreed with that. Would you like to start the thread?
cheers
--
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
## Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2002, 2004, 2005,
## 2006, 2007 Kurt Hornik
## Copyright (C) 2009 Soren Hauberg, Jaroslav Hajek
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or (at
## your option) any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING. If not, see
## <http://www.gnu.org/licenses/>.
## -*- texinfo -*-
## @deftypefn {Function File} {} cov (@var{x}, @var{y})
## Compute covariance.
##
## If each row of @var{x} and @var{y} is an observation and each column is
## a variable, the (@var{i}, @var{j})-th entry of
## @code{cov (@var{x}, @var{y})} is the covariance between the @var{i}-th
## variable in @var{x} and the @var{j}-th variable in @var{y}.
## @iftex
## @tex
## $$
## \sigma_{ij} = {1 \over N-1} \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})
## $$
## where $\bar{x}$ and $\bar{y}$ are the mean values of $x$ and $y$.
## @end tex
## @end iftex
## If called with one argument, compute @code{cov (@var{x}, @var{x})}.
## @end deftypefn
function c = cov (x, y, method = "all")
if (nargin < 1 || nargin > 3)
print_usage ();
endif
if (nargin == 1)
two_inputs = false;
elseif (nargin == 2 && ischar (y))
method = y;
two_inputs = false;
else
two_inputs = true;
endif
if (! ischar (method))
error ("cov: method must be a string");
endif
if (rows (x) == 1)
x = x.';
endif
n = rows (x);
if (two_inputs)
if (rows (y) == 1)
y = y.';
endif
if (rows (y) != n)
error ("cov: x and y must have the same number of observations");
endif
endif
if (n == 0)
if (two_inputs)
c = NA (columns (x), columns (y));
else
c = NA (columns (x), columns (x));
endif
endif
switch (lower (method))
case "all"
if (two_inputs)
x = x - ones (n, 1) * sum (x) / n;
y = y - ones (n, 1) * sum (y) / n;
c = conj (x' * y) / max (1, n - 1);
else
x = x - ones (n, 1) * sum (x) / n;
c = conj (x' * x) / max (1, n - 1);
endif
case "complete"
## we simply remove all incomplete rows.
if (two_inputs)
r = any (isna (x), 2) | any (isna (y), 2);
x (r, :) = [];
y (r, :) = [];
c = cov (x, y);
else
r = any (isna (x), 2);
x (r, :) = [];
c = cov (x);
endif
case "pairs"
## this is the most complicated case.
if (two_inputs)
## save NA masks.
xnamsk = ! isna (x);
ynamsk = ! isna (y);
## set everything non-finite to zero, to avoid Inf*0 and NaN*0
## products getting in our way.
xmsk = isfinite (x);
ymsk = isfinite (y);
x(! xmsk) = 0;
y(! ymsk) = 0;
## means
mx = sum (x) ./ sum (xmsk);
my = sum (y) ./ sum (ymsk);
## subtract them
x -= ones (n, 1) * mx;
y -= ones (n, 1) * my;
## calculate products
c = conj (x' * y);
## calc symbolic products
c1 = xmsk.' * ymsk;
## scale to get covariances
c = c ./ max (c1 - 1, 1);
## calc updated symbolic products
c2 = xnamsk.' * ynamsk;
## set the violated elements to NaN
c(c2 > c1) = NaN;
## set the zero-length covs to NA
c(c2 == 0) = NA;
else
## do the same for a single input.
## save NA masks.
xnamsk = ! isna (x);
## set everything non-finite to zero, to avoid Inf*0 and NaN*0
## products getting in our way.
xmsk = isfinite (x);
x(! xmsk) = 0;
## means
mx = sum (x) ./ sum (xmsk);
## subtract them
x -= ones (n, 1) * mx;
## calculate products
c = conj (x' * x);
## calc symbolic products
c1 = xmsk.' * xmsk;
## scale to get covariances
c = c ./ max (c1 - 1, 1);
## calc updated symbolic products
c2 = xnamsk.' * xnamsk;
## set the violated elements to NaN
c(c2 > c1) = NaN;
## set the zero-length covs to NA
c(c2 == 0) = NA;
endif
endswitch
endfunction
%!test
%! x = rand (10);
%! cx1 = cov (x);
%! cx2 = cov (x, x);
%! assert(size (cx1) == [10, 10] && size (cx2) == [10, 10] && norm(cx1-cx2) < 1e1*eps);
%!error cov ();
%!error cov (1, 2, 3);
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev