Re: MeanShift Clustering Patch

Chris Wailes Thu, 12 Aug 2010 16:01:22 -0700

>
> Also, speaking as a good hypocrite
> should (see MAHOUT-228), it would be good to have multiple small patches
> rather than a large omnibus patch since we can dispose of the small ones in
> less aggregate time than the large one.
>
> In the future I will try to keep patches small and concise :-)
Unfortunately all the work has been done and I didn't have the foresight to
create smaller patches along the way.


Some of the things that you suggest aren't what I tend to do, but in style
> issues, I value the experience of the naive reader over my own prejudices.
>  If you were confused when you read the code, I think we should improve the
> readability.  If it is something that you do "just because", then we should
> probably both defer to the standard Mahout style rather than changing it.
>
> As for the formatting work I have done I tend to think I have reasons for
it all, but in the end they just boil down to personal value-judgments.  I
find having the functions arranged as I have done makes it easier to find
the function I'm looking for when I'm using simpler text editors.  Also, I
tend to put whitespace lines between logically seperated blocks of code to
make that distinction clearer on first glance.  That way, when reading the
code (especially for the first time) you don't have to read every line to
see where one step begins and another ends.

One thing that I particularly want to have at the end of that exercise is to
> have clusters and classification models be unified.  It should not matter
> (much) where a model came from, you should be able to classify new examples
> using it.  You should also be able to save and restore the model in a
> pretty
> uniform way.
>
> This also implies that we need a consistent way to represent examples to be
> classified.
>
> In regards to the unified models for clustering and classification, I agree
that that would be nice.  However, moving in that direction all at once
seems like it would be hard.  With such big differences between the
individual clustering algorithm's models it would be hard to create a brand
new class/interface in one fell swoop that would have all of the features
that they all needed.  What I had thought of doing was to pull each of the
clusterers into this new model one at a time.  As much functionality as
possible would be moved into the abstract BaseCluster (or whatever), and a
common interface could be constructed.  After that, a similar process could
be done for the classifiers and then the overlap between the two models
could be moved into a new, unifying class.

What thoughts do you have on larger scale design issues?  What would you
> like to see?  Can you share some user stories about how you would like to
> use clustering?
>
> What I would really like to see, as a user, for the clustering API is this
clustering abstraction.  I would really like to be able to change my
clusterer simply by changing the class that is created, and not having to
re-factor all of my code from using MeanShiftCanopys to DirichletClusters.
I have only been using Mahout for a week or so on my local machine and
haven't reached a point in the project where I need to scale it up using
Hadoop yet, so I can't really comment on those issues.  This will be
something I'll be doing in the near future though.

Another change that I made and forgot to mention is the setT1() and setT2()
methods.  In my application I'll be mostly using the same clusterer over and
over again, but occasionally I'd like to let to user change these values to
get more or fewer clusters.  This gets back to making the clustering
algorithms behave more like objects instead of library functions.

- Chris Wailes

On Thu, Aug 12, 2010 at 4:34 PM, Ted Dunning <[email protected]> wrote:

> This is a great thing in general, and we were just discussing how the
> clustering and classification API's need to be made more coherent.
>
> One thing that I particularly want to have at the end of that exercise is
> to
> have clusters and classification models be unified.  It should not matter
> (much) where a model came from, you should be able to classify new examples
> using it.  You should also be able to save and restore the model in a
> pretty
> uniform way.
>
> This also implies that we need a consistent way to represent examples to be
> classified.
>
> What you are talking about so far is to make the construction of clustering
> models more consistent which is really, really good and important, but it
> needs to be in the large context of making clustering and classification
> coherent as well.
>
> What thoughts do you have on larger scale design issues?  What would you
> like to see?  Can you share some user stories about how you would like to
> use clustering?
>
> On Thu, Aug 12, 2010 at 3:08 PM, Chris Wailes <[email protected]
> >wrote:
>
> > Lastly, I made an API change so that the MeanShiftClusterer behaved in a
> > more OO fashion.  Now, instead of having a static method
> > MeanShiftClusterer.clusterPoints() that then creates a MeanShiftClusterer
> > object, there is an instance method called cluster().  It uses the same
> > code, but makes re-use a lot easier if you want to cluster several groups
> of
> > points using the same parameters.
> >
>

Re: MeanShift Clustering Patch

Reply via email to