[
https://issues.apache.org/jira/browse/MAHOUT-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vasil Vasilev updated MAHOUT-597:
---------------------------------
Attachment: meanshift_with_kernels.patch
This is a proposal for implementing various types of kernels with the means
shift algorithm. The following changes were introduced:
1. The original algorithm was changed: Previously when one cluster "touched"
other clusters its centroid was computed only based on the centroids of the
clusters it touched, but not based on his centroid itself. Now before
calculating the shift I one more line was added which makes the cluster observe
his personal centroid:
public boolean shiftToMean(MeanShiftCanopy canopy) {
canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
canopy.computeConvergence(measure, convergenceDelta);
canopy.computeParameters(); return canopy.isConverged(); }
With this modification also a problem with convergence of the algorithm is
resolved, when there are 2 clusters left which are within their T1 boundaries
and not within their T2 boundaries. In this cases their centroids will switch
from one another until infinity.
With this modification, however, the number of iterations until convergence
increased slightly.
This change was necessary in order to introduce the other types of "soft"
kernels.
2. IKernelProfile interface was introduced which has the methods:
public double calculateValue(double distance, double h);
public double calculateDerivativeValue(double distance, double h);
The first one is not used by the current implementations, because the value of
the kernel profile - k - is not needed in calculating the meanshift, rather the
value of the derivative of the kernel profile - g - is used.
3. Currenlty 2 implementations are created:
TriangularKernelProfile with calculated value:
@Override
public double calculateDerivativeValue(double distance, double h) { return
(distance < h) ? 1.0 : 0.0; }
This profile covers the cases that are currently covered by the meanshift
algorithm implementation
In addition NormalKernelProfile was created with calculated value:
@Override
public double calculateDerivativeValue(double distance, double h) { return
UncommonDistributions.dNorm(distance, 0.0, h); }
4. The merging of canopies now uses the kernel profile
public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy>
canopies) {
MeanShiftCanopy closestCoveringCanopy = null;
double closestNorm = Double.MAX_VALUE;
for (MeanShiftCanopy canopy : canopies) {
double norm = measure.distance(canopy.getCenter(), aCanopy.getCenter());
double weight = kernelProfile.calculateDerivativeValue(norm, t1);
if(weight > 0.0)
{ aCanopy.touch(canopy, weight); }
if (norm < t2 && ((closestCoveringCanopy == null) || (norm < closestNorm))) {
closestNorm = norm; closestCoveringCanopy = canopy; }
}
if (closestCoveringCanopy == null) { canopies.add(aCanopy); } else {
closestCoveringCanopy.merge(aCanopy); }
}
5. The MeanShiftCanopy is modified so that its touch method includes also the
weight with which the clusters should observe each other. This weight was
calculated based on the kernel profile in use
void touch(MeanShiftCanopy canopy, double weight) { canopy.observe(getCenter(),
weight*((double)boundPoints.size())); observe(canopy.getCenter(),
weight*((double)canopy.boundPoints.size())); }
6. Some other modifications were made which were necessary to compile the code
and to make the tests pass
With the so changed algorithm the following observations can be made:
1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster content
is more intermixed
2. The following procedure for estimating T2 and convergence delta can be used:
* bind convergence delta = T2 / 2
* When you decrease T2 the number of iterations increases and the number of
resulting clusters after convergence decreases
* You come to a moment when you decrease T2 the number of iterations
increases, but the number of resulting clusters remains unchanged. This is the
point with the best value for T2
3. In case of Normal kernel what you give as T1 is in fact the standard
deviation of a normal distribution, so the radius of the window will be T1^2
> Kernels in Mean Shift
> ---------------------
>
> Key: MAHOUT-597
> URL: https://issues.apache.org/jira/browse/MAHOUT-597
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Vasil Vasilev
> Priority: Minor
> Attachments: meanshift_with_kernels.patch
>
>
> Improve currently implemented variant of the mean shift algorithm in Mahout
> so that kernels are supported.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.