søn, 08 03 2009 kl. 09:40 +0100, skrev Jaroslav Hajek:
> 1. "use all" -> "all" etc - I think this is more Octavish

Agreed.

> 2. covariances of zero-length vectors are returned as NA. covariances
> of length 1 vectors are zero.

Makes sense.

> 3. vectorizing the "pairs" case was really tricky (due to NaN/Inf/NA
> issues), but I think I got there in the end. I welcome testing.

I tried the following:

  ## Create data
  data = rand (10, 2);
  na_data = data;
  na_data (6, 1) = na_data (7, 2) = NA;

  ## Compute covariances
  c1 = cov (na_data, "complete");
  c2 = cov (na_data, "pairs");

I get

c1 =

   0.062607   0.042061
   0.042061   0.081121

which seems right, but

c2 =

   NaN   NaN
   NaN   NaN

which doesn't really seem right.

> PS. this shows that for "cov", the penalty incurred by NA handling is
> nontrivial, especially for "pairs". Further, it is not clear which one
> of "complete" or "pairs" should be the default.

I actually think "all" should be default as this is the compatible
behaviour. This is also what R does, so statisticians should be happy. 

[a couple of minutes later]

On modern processors NaN (and hence NA) handling is really slow. So,
just to get an idea of how this influences performance I did

  octave:20> data = rand (10000, 20);
  octave:21> na_data = data; na_data (6, 1) = na_data (7, 2) = NA;
  octave:22> tic, cov (data); toc
  Elapsed time is 0.0366599 seconds.
  octave:23> tic, cov (na_data); toc
  Elapsed time is 0.216626 seconds.
  octave:24> tic, cov (na_data, "complete"); toc
  Elapsed time is 0.055954 seconds.

So, removing NA's actually speed up the computation, while providing a
more sensible result. Of course, when NA's aren't present the cost of
checking for NA's is present. Hmm, now I'm not sure about the default
behaviour...

>  I think this and
> Matlab/R compatibility sums up to just not care about missing values
> by default.  For consistency, we should probably do the same for mean,
> std etc.
> 
> Opinions?

I think the most important point of this thread is that it seems
reasonable/possible to skip NA's in statistical functions. So, I guess
it makes sense to discuss doing this at the maintainers list to get a
feel of the general opinion of doing this.

Søren


------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Reply via email to