[ 
https://issues.apache.org/jira/browse/MAHOUT-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasil Vasilev updated MAHOUT-597:
---------------------------------

    Attachment: meanshift_with_kernels.patch

This is a proposal for implementing various types of kernels with the means 
shift algorithm. The following changes were introduced:

1. The original algorithm was changed: Previously when one cluster "touched" 
other clusters its centroid was computed only based on the centroids of the 
clusters it touched, but not based on his centroid itself. Now before 
calculating the shift I one more line was added which makes the cluster observe 
his personal centroid:

public boolean shiftToMean(MeanShiftCanopy canopy) { 
canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); 
canopy.computeConvergence(measure, convergenceDelta); 
canopy.computeParameters(); return canopy.isConverged(); }

With this modification also a problem with convergence of the algorithm is 
resolved, when there are 2 clusters left which are within their T1 boundaries 
and not within their T2 boundaries. In this cases their centroids will switch 
from one another until infinity.
With this modification, however, the number of iterations until convergence 
increased slightly.
This change was necessary in order to introduce the other types of "soft" 
kernels.

2. IKernelProfile interface was introduced which has the methods:
public double calculateValue(double distance, double h);
public double calculateDerivativeValue(double distance, double h);

The first one is not used by the current implementations, because the value of 
the kernel profile - k - is not needed in calculating the meanshift, rather the 
value of the derivative of the kernel profile - g - is used.

3. Currenlty 2 implementations are created:
TriangularKernelProfile with calculated value:
@Override
public double calculateDerivativeValue(double distance, double h) { return 
(distance < h) ? 1.0 : 0.0; }
This profile covers the cases that are currently covered by the meanshift 
algorithm implementation

In addition NormalKernelProfile was created with calculated value:
@Override
public double calculateDerivativeValue(double distance, double h) { return 
UncommonDistributions.dNorm(distance, 0.0, h); }

4. The merging of canopies now uses the kernel profile

public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy> 
canopies) {
MeanShiftCanopy closestCoveringCanopy = null;
double closestNorm = Double.MAX_VALUE;
for (MeanShiftCanopy canopy : canopies) {
double norm = measure.distance(canopy.getCenter(), aCanopy.getCenter());
double weight = kernelProfile.calculateDerivativeValue(norm, t1);
if(weight > 0.0)
{ aCanopy.touch(canopy, weight); }

if (norm < t2 && ((closestCoveringCanopy == null) || (norm < closestNorm))) { 
closestNorm = norm; closestCoveringCanopy = canopy; }
}
if (closestCoveringCanopy == null) { canopies.add(aCanopy); } else { 
closestCoveringCanopy.merge(aCanopy); }
}

5. The MeanShiftCanopy is modified so that its touch method includes also the 
weight with which the clusters should observe each other. This weight was 
calculated based on the kernel profile in use

void touch(MeanShiftCanopy canopy, double weight) { canopy.observe(getCenter(), 
weight*((double)boundPoints.size())); observe(canopy.getCenter(), 
weight*((double)canopy.boundPoints.size())); }

6. Some other modifications were made which were necessary to compile the code 
and to make the tests pass

With the so changed algorithm the following observations can be made:

1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster content 
is more intermixed
2. The following procedure for estimating T2 and convergence delta can be used:

    * bind convergence delta = T2 / 2
    * When you decrease T2 the number of iterations increases and the number of 
resulting clusters after convergence decreases
    * You come to a moment when you decrease T2 the number of iterations 
increases, but the number of resulting clusters remains unchanged. This is the 
point with the best value for T2
      3. In case of Normal kernel what you give as T1 is in fact the standard 
deviation of a normal distribution, so the radius of the window will be T1^2


> Kernels in Mean Shift
> ---------------------
>
>                 Key: MAHOUT-597
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-597
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>         Attachments: meanshift_with_kernels.patch
>
>
> Improve currently implemented variant of the mean shift algorithm in Mahout 
> so that kernels are supported.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to