As I was writing on octave-help:

>I am stydying clustering and started from the pdist function in the
>statistics package.
>
>I see that it has some problems.  For one, the seuclidean, mahalanobis,
>minkowski, cosine, correlation and spearman distances give various
>errors when tried on the vector [1 3; 2 3; 4 2; 4 5; 5 2].
>
>Second, while the euclidean, cityblock and chebychev distances appear to
>work, the hamming and jaccard distances apparently give wrong results.
>
>But the most significant problem is that the method it uses, as far as I
>can see, cannot work for seuclidean, mahalanobis, minkowski, cosine,
>correlation, that is, for those where the distance of two vectors
>depends on all the other vectors.

I finished rewriting pdist from scratch, by looking at the documentation
page published on the Mathworks site.  I had my implementation checked
against Matlab (I do not have Matlab myself) and it gives in fact the
same results for the test cases I tried.  I added tests to the bottom of
the pdist file.

So I think it is ready to be checked in into octave-forge's archive.

Unfortunately, it is a long time since I last updated my copy of
octave-forge's repository, and my CVS access (Sourceforge user fpoto)
does not work any more.  I see that you have switched to SVN.  Would
anyone be so kind as to explain me how to resync my old tree and upload
the new pdist.m file?

In the meantime, here is the file.


===File ~/math/octavelib/statistics/pdist.m=================
## Copyright (C) 2008  Francesco Potortì  <[EMAIL PROTECTED]>
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this software; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn {Function File} @var{y} = pdist (@var{x})
## @deftypefnx {Function File} @var{y} = pdist (@var{x}, @var{distfun})
## @deftypefnx {Function File} @var{y} = pdist (@var{x},
## @var{distfun}, @var{distfunarg}, @dots{})
##
## Return the distance between any two rows in @var{x}.
##
## @var{x} is the @[EMAIL PROTECTED] matrix representing @var{q} row vectors of
## size @var{d}.
##
## The output is a dissimilarity matrix arranged into a row vector
## @var{y},(n - 1) * (n / 2) long, where the distances are in the order
## [(1, 2) (1, 3) @dots{} (2, 3) @dots{} (n-1, n)].  You can use the
## @seealso{squareform} function to display the distances between the
## vectors arranged into an @[EMAIL PROTECTED] matrix.
##
## @var{distfun} is an optional argument specifying how the distance is
## computed. It can be any of the following ones, defaulting to
## "euclidean", or a user defined function that takes two arguments
## distfun @var{x} and @var{y} plus any number of optional arguments,
## where @var{x} is a row vector and and @var{y} is a matrix having the
## same number of columns as @var{x}.  @var{distfun} returns a column
## vector where row @var{i} is the distance between @var{x} and row
## @var{i} of @var{y}. Any additional arguments after the @var{distfun}
## are passed as distfun (@var{x}, @var{y}, @var{distfunarg1},
## @var{distfunarg2} @dots{}).
##
## Predefined distance functions are:
##
## @table @samp
## @item "euclidean"
## Euclidean distance (default).
##
## @item "seuclidean"
## Standardized Euclidean distance. Each coordinate in the sum of
## squares is inverse weighted by the sample variance of that
## coordinate.
##
## @item "mahalanobis"
## Mahalanobis distance: @seealso{mahalanobis}.
##
## @item "cityblock"
## City Block metric, aka Manhattan distance.
##
## @item "minkowski"
## Minkowski metric.  Accepts a numeric parameter @var{p}: for @var{p}=1
## this is the same as the cityblock metric, with @var{p}=2 (default) it
## is equal to the euclidean metric.
##
## @item "cosine"
## One minus the cosine of the included angle between rows, seen as
## vectors.
##
## @item "correlation"
## One minus the sample correlation between points (treated as
## sequences of values).
##
## @item "spearman"
## One minus the sample Spearman's rank correlation between
## observations, treated as sequences of values.
##
## @item "hamming"
## Hamming distance: the quote of the number of coordinates that differ.
##
## @item "jaccard"
## One minus the Jaccard coefficient, the quote of nonzero
## coordinates that differ.
##
## @item "chebychev"
## Chebychev distance: the maximum coordinate difference.
## @end table
## @seealso{linkage,squareform}
## @end deftypefn

## Author: Francesco Potortì  <[EMAIL PROTECTED]>

function y = pdist (x, distfun, varargin)

  if (nargin < 1)
    print_usage ();
  elseif ((nargin > 1)
          && ! ischar (distfun)
          && ! isa (distfun, "function_handle"))
    error (["pdist: the distance function must be either a string or a "
            "function handle."]);
  endif

  if (nargin < 2)
    distfun = "euclidean";
  endif

  if (! ismatrix (x) || isempty (x))
    error ("pdist: x must be a nonempty matrix");
  elseif (length (size (x)) > 2)
    error ("pdist: x must be 1 or 2 dimensional");
  endif

  if (ischar (distfun))
    order = nchoosek(1:rows(x),2);
    Xi = order(:,1);
    Yi = order(:,2);
    X = x';
    y = feval (["pdist_" distfun], x', Xi, Yi, varargin{:});
    switch (distfun)
      case "euclidean"
        diff = X(:,Xi) - X(:,Yi);
        d = sqrt (sumsq (diff));
        return
      case "seuclidean"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (diag (var (X')));
        d = sqrt (sum ((weights * diff) .* diff));
        return
      case "mahalanobis"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (cov (X'));
        d = sqrt (sum ((weights * diff) .* diff));
        return
      case "cityblock"
        diff = X(:,Xi) - X(:,Yi);
        d = sum (abs (diff));
        return
      case "minkowski"
        diff = X(:,Xi) - X(:,Yi);
        if (nargin > 2)
          p = varargin{1};
          d = (sum ((abs (diff)).^p)).^(1/p);
        else
          d = sqrt (sumsq (diff)); # default p=2
        endif
        return
      case "cosine"
        prod = X(:,Xi) .* X(:,Yi);
        weights = sumsq (X(:,Xi)) .* sumsq (X(:,Yi));
        d = 1 - sum (prod) ./ sqrt (weights);
        return
      case "correlation"
        corr = cor (X);
        d = 1 - corr (sub2ind (size (corr), Xi, Yi))';
        return
      case "spearman"
        corr = spearman (X);
        d = 1 - corr (sub2ind (size (corr), Xi, Yi))';
        return
      case "hamming"
        diff = logical (X(:,Xi) - X(:,Yi));
        d = sum (diff) / rows (X);
        return
      case "jaccard"
        diff = logical (X(:,Xi) - X(:,Yi));
        weights = X(:,Xi) | X(:,Yi);
        d = sum (diff & weights) ./ sum (weights);
        return
      case "chebychev"
        diff = X(:,Xi) - X(:,Yi);
        d = max (abs (diff));
        return
    endswitch
  endif

  ## Distfun is a function handle or the name of an external function
  l = rows (x);
  y = zeros (1, nchoosek (l, 2))
  idx = 1;
  for ii = 1:l-1
    for jj = ii+1:l
      y(idx++) = feval (distfun, x(ii,:), x, varargin{:})(jj);
    endfor
  endfor

endfunction

%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"euclidean"),  [1.000 8.602 7.071 8.062 6.403 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"seuclidean"), [0.380 2.735 2.363 2.486 2.070 
0.561],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"mahalanobis"),[1.384 1.967 2.446 2.384 1.535 
2.045],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"cityblock"),  [1.000 12.00 10.00 11.00 9.000 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"minkowski"),  [1.000 8.602 7.071 8.062 6.403 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"minkowski",3),[1.000 7.763 6.299 7.410 5.738 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"cosine"),     [0.000 0.349 0.231 0.349 0.231 
0.013],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"correlation"),[0.000 2.000 0.000 2.000 0.000 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"spearman"),   [0.000 2.000 0.000 2.000 0.000 
2.000],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"hamming"),    [0.500 1.000 1.000 1.000 1.000 
0.500],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"jaccard"),    [1.000 1.000 1.000 1.000 1.000 
0.500],1e-3);
%!test
%! assert(pdist([0 1;0 2;7 6;5 6],"chebychev"),  [1.000 7.000 5.000 7.000 5.000 
2.000],1e-3);
============================================================

-- 
Francesco Potortì (ricercatore)        Voice: +39 050 315 3058 (op.2111)
ISTI - Area della ricerca CNR          Fax:   +39 050 315 2040
via G. Moruzzi 1, I-56124 Pisa         Email: [EMAIL PROTECTED]
(entrance 20, 1st floor, room C71)     Web:   http://fly.isti.cnr.it/


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Reply via email to