Re: [OctDev] pdist function

Jaroslav Hajek Wed, 12 Nov 2008 09:01:18 -0800

On Wed, Nov 12, 2008 at 2:36 PM, Francesco Potortì <[EMAIL PROTECTED]> wrote:
> As I was writing on octave-help:
>
>>I am stydying clustering and started from the pdist function in the
>>statistics package.
>>
>>I see that it has some problems.  For one, the seuclidean, mahalanobis,
>>minkowski, cosine, correlation and spearman distances give various
>>errors when tried on the vector [1 3; 2 3; 4 2; 4 5; 5 2].
>>
>>Second, while the euclidean, cityblock and chebychev distances appear to
>>work, the hamming and jaccard distances apparently give wrong results.
>>
>>But the most significant problem is that the method it uses, as far as I
>>can see, cannot work for seuclidean, mahalanobis, minkowski, cosine,
>>correlation, that is, for those where the distance of two vectors
>>depends on all the other vectors.
>
> I finished rewriting pdist from scratch, by looking at the documentation
> page published on the Mathworks site.  I had my implementation checked
> against Matlab (I do not have Matlab myself) and it gives in fact the
> same results for the test cases I tried.  I added tests to the bottom of
> the pdist file.
>
> So I think it is ready to be checked in into octave-forge's archive.
>
> Unfortunately, it is a long time since I last updated my copy of
> octave-forge's repository, and my CVS access (Sourceforge user fpoto)
> does not work any more.  I see that you have switched to SVN.  Would
> anyone be so kind as to explain me how to resync my old tree and upload
> the new pdist.m file?
>
> In the meantime, here is the file.
>
>
> ===File ~/math/octavelib/statistics/pdist.m=================
> ## Copyright (C) 2008  Francesco Potortì  <[EMAIL PROTECTED]>
> ##
> ## This program is free software; you can redistribute it and/or modify
> ## it under the terms of the GNU General Public License as published by
> ## the Free Software Foundation; either version 3, or (at your option)
> ## any later version.
> ##
> ## This program is distributed in the hope that it will be useful, but
> ## WITHOUT ANY WARRANTY; without even the implied warranty of
> ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> ## General Public License for more details.
> ##
> ## You should have received a copy of the GNU General Public License
> ## along with this software; see the file COPYING.  If not, see
> ## <http://www.gnu.org/licenses/>.
>
> ## -*- texinfo -*-
> ## @deftypefn {Function File} @var{y} = pdist (@var{x})
> ## @deftypefnx {Function File} @var{y} = pdist (@var{x}, @var{distfun})
> ## @deftypefnx {Function File} @var{y} = pdist (@var{x},
> ## @var{distfun}, @var{distfunarg}, @dots{})
> ##
> ## Return the distance between any two rows in @var{x}.
> ##
> ## @var{x} is the @[EMAIL PROTECTED] matrix representing @var{q} row vectors 
> of
> ## size @var{d}.
> ##
> ## The output is a dissimilarity matrix arranged into a row vector
> ## @var{y},(n - 1) * (n / 2) long, where the distances are in the order
> ## [(1, 2) (1, 3) @dots{} (2, 3) @dots{} (n-1, n)].  You can use the
> ## @seealso{squareform} function to display the distances between the
> ## vectors arranged into an @[EMAIL PROTECTED] matrix.
> ##
> ## @var{distfun} is an optional argument specifying how the distance is
> ## computed. It can be any of the following ones, defaulting to
> ## "euclidean", or a user defined function that takes two arguments
> ## distfun @var{x} and @var{y} plus any number of optional arguments,
> ## where @var{x} is a row vector and and @var{y} is a matrix having the
> ## same number of columns as @var{x}.  @var{distfun} returns a column
> ## vector where row @var{i} is the distance between @var{x} and row
> ## @var{i} of @var{y}. Any additional arguments after the @var{distfun}
> ## are passed as distfun (@var{x}, @var{y}, @var{distfunarg1},
> ## @var{distfunarg2} @dots{}).
> ##
> ## Predefined distance functions are:
> ##
> ## @table @samp
> ## @item "euclidean"
> ## Euclidean distance (default).
> ##
> ## @item "seuclidean"
> ## Standardized Euclidean distance. Each coordinate in the sum of
> ## squares is inverse weighted by the sample variance of that
> ## coordinate.
> ##
> ## @item "mahalanobis"
> ## Mahalanobis distance: @seealso{mahalanobis}.
> ##
> ## @item "cityblock"
> ## City Block metric, aka Manhattan distance.
> ##
> ## @item "minkowski"
> ## Minkowski metric.  Accepts a numeric parameter @var{p}: for @var{p}=1
> ## this is the same as the cityblock metric, with @var{p}=2 (default) it
> ## is equal to the euclidean metric.
> ##
> ## @item "cosine"
> ## One minus the cosine of the included angle between rows, seen as
> ## vectors.
> ##
> ## @item "correlation"
> ## One minus the sample correlation between points (treated as
> ## sequences of values).
> ##
> ## @item "spearman"
> ## One minus the sample Spearman's rank correlation between
> ## observations, treated as sequences of values.
> ##
> ## @item "hamming"
> ## Hamming distance: the quote of the number of coordinates that differ.
> ##
> ## @item "jaccard"
> ## One minus the Jaccard coefficient, the quote of nonzero
> ## coordinates that differ.
> ##
> ## @item "chebychev"
> ## Chebychev distance: the maximum coordinate difference.
> ## @end table
> ## @seealso{linkage,squareform}
> ## @end deftypefn
>
> ## Author: Francesco Potortì  <[EMAIL PROTECTED]>
>
> function y = pdist (x, distfun, varargin)
>
>  if (nargin < 1)
>    print_usage ();
>  elseif ((nargin > 1)
>          && ! ischar (distfun)
>          && ! isa (distfun, "function_handle"))
>    error (["pdist: the distance function must be either a string or a "
>            "function handle."]);
>  endif
>
>  if (nargin < 2)
>    distfun = "euclidean";
>  endif
>
>  if (! ismatrix (x) || isempty (x))
>    error ("pdist: x must be a nonempty matrix");
>  elseif (length (size (x)) > 2)
>    error ("pdist: x must be 1 or 2 dimensional");
>  endif
>
>  if (ischar (distfun))
>    order = nchoosek(1:rows(x),2);
>    Xi = order(:,1);
>    Yi = order(:,2);
>    X = x';
>    y = feval (["pdist_" distfun], x', Xi, Yi, varargin{:});
>    switch (distfun)
>      case "euclidean"
>        diff = X(:,Xi) - X(:,Yi);
>        d = sqrt (sumsq (diff));
>        return
>      case "seuclidean"
>        diff = X(:,Xi) - X(:,Yi);
>        weights = inv (diag (var (X')));
>        d = sqrt (sum ((weights * diff) .* diff));
>        return
>      case "mahalanobis"
>        diff = X(:,Xi) - X(:,Yi);
>        weights = inv (cov (X'));
>        d = sqrt (sum ((weights * diff) .* diff));
>        return
>      case "cityblock"
>        diff = X(:,Xi) - X(:,Yi);
>        d = sum (abs (diff));
>        return
>      case "minkowski"
>        diff = X(:,Xi) - X(:,Yi);
>        if (nargin > 2)
>          p = varargin{1};
>          d = (sum ((abs (diff)).^p)).^(1/p);
>        else
>          d = sqrt (sumsq (diff)); # default p=2
>        endif
>        return
>      case "cosine"
>        prod = X(:,Xi) .* X(:,Yi);
>        weights = sumsq (X(:,Xi)) .* sumsq (X(:,Yi));
>        d = 1 - sum (prod) ./ sqrt (weights);
>        return
>      case "correlation"
>        corr = cor (X);
>        d = 1 - corr (sub2ind (size (corr), Xi, Yi))';
>        return
>      case "spearman"
>        corr = spearman (X);
>        d = 1 - corr (sub2ind (size (corr), Xi, Yi))';
>        return
>      case "hamming"
>        diff = logical (X(:,Xi) - X(:,Yi));
>        d = sum (diff) / rows (X);
>        return
>      case "jaccard"
>        diff = logical (X(:,Xi) - X(:,Yi));
>        weights = X(:,Xi) | X(:,Yi);
>        d = sum (diff & weights) ./ sum (weights);
>        return
>      case "chebychev"
>        diff = X(:,Xi) - X(:,Yi);
>        d = max (abs (diff));
>        return
>    endswitch
>  endif
>
>  ## Distfun is a function handle or the name of an external function
>  l = rows (x);
>  y = zeros (1, nchoosek (l, 2))
>  idx = 1;
>  for ii = 1:l-1
>    for jj = ii+1:l
>      y(idx++) = feval (distfun, x(ii,:), x, varargin{:})(jj);
>    endfor
>  endfor
>
> endfunction
>
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"euclidean"),  [1.000 8.602 7.071 8.062 
> 6.403 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"seuclidean"), [0.380 2.735 2.363 2.486 
> 2.070 0.561],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"mahalanobis"),[1.384 1.967 2.446 2.384 
> 1.535 2.045],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"cityblock"),  [1.000 12.00 10.00 11.00 
> 9.000 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"minkowski"),  [1.000 8.602 7.071 8.062 
> 6.403 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"minkowski",3),[1.000 7.763 6.299 7.410 
> 5.738 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"cosine"),     [0.000 0.349 0.231 0.349 
> 0.231 0.013],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"correlation"),[0.000 2.000 0.000 2.000 
> 0.000 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"spearman"),   [0.000 2.000 0.000 2.000 
> 0.000 2.000],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"hamming"),    [0.500 1.000 1.000 1.000 
> 1.000 0.500],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"jaccard"),    [1.000 1.000 1.000 1.000 
> 1.000 0.500],1e-3);
> %!test
> %! assert(pdist([0 1;0 2;7 6;5 6],"chebychev"),  [1.000 7.000 5.000 7.000 
> 5.000 2.000],1e-3);
> ============================================================
>
> --
> Francesco Potortì (ricercatore)        Voice: +39 050 315 3058 (op.2111)
> ISTI - Area della ricerca CNR          Fax:   +39 050 315 2040
> via G. Moruzzi 1, I-56124 Pisa         Email: [EMAIL PROTECTED]
> (entrance 20, 1st floor, room C71)     Web:   http://fly.isti.cnr.it/


Hi,

a nice job (I wanted this function several times).

I have two comments:
1. I think that instead of
if (...)
  switch ...
   case ...
     return
   case ...
     return
endif

...code...

you should use
if (...)
  switch ...
   case ...
   case ...
else
 ...code...
endif

I.e., no return in each case. This is more consistent with our
recommended coding style, where we try to minimize the number of exit
points from a function.

2. The changeset http://hg.savannah.gnu.org/hgweb/octave/rev/b11c31849b44
equipped `norm' with the ability to compute column or row norms of a matrix.
This can be used to your explicit norm expressions like
`sqrt (sumsq (diff))' or `(sum ((abs (diff)).^p)).^(1/p)'
by
`norm (diff, 'cols')' or `norm (diff, p, 'cols')', respectively.
Similarly for the 1- and Inf- norm.
Using the latter will probably be faster (avoids temporary matrices)
and will also be robust w.r.t. overflow (i.e. the 20-norm of numbers
of order 1e20 won't be Inf).
This is going to be a feature of 3.2.0, so if you're fine with pdist
depending on 3.2.0, I think you may exploit it.

regards

-- 
RNDr. Jaroslav Hajek
computing expert
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Re: [OctDev] pdist function

Reply via email to