On Sun, Mar 8, 2009 at 10:06 AM, Søren Hauberg <so...@hauberg.org> wrote:
> søn, 08 03 2009 kl. 09:40 +0100, skrev Jaroslav Hajek:
>> 1. "use all" -> "all" etc - I think this is more Octavish
>
> Agreed.
>
>> 2. covariances of zero-length vectors are returned as NA. covariances
>> of length 1 vectors are zero.
>
> Makes sense.
>
>> 3. vectorizing the "pairs" case was really tricky (due to NaN/Inf/NA
>> issues), but I think I got there in the end. I welcome testing.
>
> I tried the following:
>
>  ## Create data
>  data = rand (10, 2);
>  na_data = data;
>  na_data (6, 1) = na_data (7, 2) = NA;
>
>  ## Compute covariances
>  c1 = cov (na_data, "complete");
>  c2 = cov (na_data, "pairs");
>
> I get
>
> c1 =
>
>   0.062607   0.042061
>   0.042061   0.081121
>
> which seems right, but
>
> c2 =
>
>   NaN   NaN
>   NaN   NaN
>
> which doesn't really seem right.
>

OK, I see the problem. Attached an update.


>> PS. this shows that for "cov", the penalty incurred by NA handling is
>> nontrivial, especially for "pairs". Further, it is not clear which one
>> of "complete" or "pairs" should be the default.
>
> I actually think "all" should be default as this is the compatible
> behaviour. This is also what R does, so statisticians should be happy.
>

Agreed.

> [a couple of minutes later]
>
> On modern processors NaN (and hence NA) handling is really slow. So,
> just to get an idea of how this influences performance I did
>
>  octave:20> data = rand (10000, 20);
>  octave:21> na_data = data; na_data (6, 1) = na_data (7, 2) = NA;
>  octave:22> tic, cov (data); toc
>  Elapsed time is 0.0366599 seconds.
>  octave:23> tic, cov (na_data); toc
>  Elapsed time is 0.216626 seconds.
>  octave:24> tic, cov (na_data, "complete"); toc
>  Elapsed time is 0.055954 seconds.
>
> So, removing NA's actually speed up the computation, while providing a
> more sensible result. Of course, when NA's aren't present the cost of
> checking for NA's is present. Hmm, now I'm not sure about the default
> behaviour...
>

Interesting, but maybe it can't be generalized? I get:

octave:1> data = rand (10000, 20);
octave:2> na_data = data; na_data (6, 1) = na_data (7, 2) = NA;
octave:3> tic, cov (data); toc
Elapsed time is 0.00816202 seconds.
octave:4> tic, cov (na_data); toc
Elapsed time is 0.00428605 seconds.
octave:5> tic, cov (na_data, "complete"); toc
Elapsed time is 0.00770807 seconds.

Do I have a less modern processor?


>>  I think this and
>> Matlab/R compatibility sums up to just not care about missing values
>> by default.  For consistency, we should probably do the same for mean,
>> std etc.
>>
>> Opinions?
>
> I think the most important point of this thread is that it seems
> reasonable/possible to skip NA's in statistical functions. So, I guess
> it makes sense to discuss doing this at the maintainers list to get a
> feel of the general opinion of doing this.
>
> Søren
>

Hey I thought we were on the list :(
Agreed with that. Would you like to start the thread?


cheers

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
## Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2002, 2004, 2005,
##               2006, 2007 Kurt Hornik
## Copyright (C) 2009 Soren Hauberg, Jaroslav Hajek
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or (at
## your option) any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn {Function File} {} cov (@var{x}, @var{y})
## Compute covariance.
##
## If each row of @var{x} and @var{y} is an observation and each column is
## a variable, the (@var{i}, @var{j})-th entry of
## @code{cov (@var{x}, @var{y})} is the covariance between the @var{i}-th
## variable in @var{x} and the @var{j}-th variable in @var{y}.
## @iftex
## @tex
## $$
## \sigma_{ij} = {1 \over N-1} \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})
## $$
## where $\bar{x}$ and $\bar{y}$ are the mean values of $x$ and $y$.
## @end tex
## @end iftex
## If called with one argument, compute @code{cov (@var{x}, @var{x})}.
## @end deftypefn

function c = cov (x, y, method = "all")

  if (nargin < 1 || nargin > 3)
    print_usage ();
  endif
  
  if (nargin == 1)
    two_inputs = false;
  elseif (nargin == 2 && ischar (y))
    method = y;
    two_inputs = false;
  else
    two_inputs = true;
  endif
  
  if (! ischar (method))
    error ("cov: method must be a string");
  endif
  
  if (rows (x) == 1)
    x = x.';
  endif
  n = rows (x);
  if (two_inputs)
    if (rows (y) == 1)
      y = y.';
    endif
    if (rows (y) != n)
      error ("cov: x and y must have the same number of observations");
    endif
  endif

  if (n == 0)
    if (two_inputs)
      c = NA (columns (x), columns (y));
    else
      c = NA (columns (x), columns (x));
    endif
  endif

  switch (lower (method))
    case "all"
      if (two_inputs)
        x = x - ones (n, 1) * sum (x) / n;
        y = y - ones (n, 1) * sum (y) / n;
        c = conj (x' * y) / max (1, n - 1);
      else
        x = x - ones (n, 1) * sum (x) / n;
        c = conj (x' * x) / max (1, n - 1);
      endif
    case "complete"
      ## we simply remove all incomplete rows.
      if (two_inputs)
	r = any (isna (x), 2) | any (isna (y), 2);
        x (r, :) = [];
        y (r, :) = [];
        c = cov (x, y);
      else
        r = any (isna (x), 2);
        x (r, :) = [];
        c = cov (x);
      endif
    case "pairs"
      ## this is the most complicated case.
      if (two_inputs)
        ## save NA masks.
        xnamsk = ! isna (x);
        ynamsk = ! isna (y);
	## set everything non-finite to zero, to avoid Inf*0 and NaN*0
	## products getting in our way.
	xmsk = isfinite (x);
	ymsk = isfinite (y);
	x(! xmsk) = 0;
	y(! ymsk) = 0;
	## means
	mx = sum (x) ./ sum (xmsk);
	my = sum (y) ./ sum (ymsk);
	## subtract them
	x -= ones (n, 1) * mx;
	y -= ones (n, 1) * my;
	## calculate products
	c = conj (x' * y);
	## calc symbolic products
	c1 = xmsk.' * ymsk;
	## scale to get covariances
	c = c ./ max (c1 - 1, 1);
	## calc updated symbolic products
	c2 = xnamsk.' * ynamsk;
	## set the violated elements to NaN
	c(c2 > c1) = NaN; 
	## set the zero-length covs to NA
	c(c2 == 0) = NA;
      else
	## do the same for a single input.
        ## save NA masks.
        xnamsk = ! isna (x);
	## set everything non-finite to zero, to avoid Inf*0 and NaN*0
	## products getting in our way.
	xmsk = isfinite (x);
	x(! xmsk) = 0;
	## means
	mx = sum (x) ./ sum (xmsk);
	## subtract them
	x -= ones (n, 1) * mx;
	## calculate products
	c = conj (x' * x);
	## calc symbolic products
	c1 = xmsk.' * xmsk;
	## scale to get covariances
	c = c ./ max (c1 - 1, 1);
	## calc updated symbolic products
	c2 = xnamsk.' * xnamsk;
	## set the violated elements to NaN
	c(c2 > c1) = NaN; 
	## set the zero-length covs to NA
	c(c2 == 0) = NA;
      endif
  endswitch

endfunction

%!test
%! x = rand (10);
%! cx1 = cov (x);
%! cx2 = cov (x, x);
%! assert(size (cx1) == [10, 10] && size (cx2) == [10, 10] && norm(cx1-cx2) < 1e1*eps);

%!error cov ();

%!error cov (1, 2, 3);

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Reply via email to