Re: Clustering techniques, tips and tricks

Bogdan Vatkov Fri, 01 Jan 2010 07:46:24 -0800

On Fri, Jan 1, 2010 at 3:24 PM, Grant Ingersoll <[email protected]> wrote:


>
> On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote:
>
> > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[email protected]
> >wrote:
> >
> >>
> >> I would like to give some feedback. And ask some questions as well :).
> >>
> >
> > Thank you!
> >
> > Very helpful feedback.
> >
> >
> >> ... Carrot2 for 2 weeks ... has great level of
> >> usability and simplicity but ...I had to give up on it since my very
> first
> >> practical clustering task required to cluster 23K+ documents.
> >
> >
> > Not too surprising.
>
> Right, Carrot2 is designed for clustering search results, and of that
> mainly the title and snippet.  While it can do larger docs, they are
> specifically not the target.  Plus, C2 is an in-memory tool designed to be
> very fast for search results.
>
>
> >
> >
> >> ...
> >> I have managed to do some clustering on my 23 000+ docs with
> Mahout/k-means
> >> for something like 10 min (in standalone mode - no parallel processing
> at
> >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout)
> but
> >> I
> >> am still learning and still trying to analyze if the result clusters are
> >> really meaningful for my docs.
> >>
> >
> > I have seen this effect before where a map-reduce program run
> sequentially
> > is much faster than an all-in-memory implementation.
> >
> >
> >> One thing I can tell already now is that I definitely, desperately, need
> >> word-stopping
> >
> >
> > You should be able to do this in the document -> vector conversion.  You
> > could also do this at the vector level by multiplying the coordinates of
> all
> > stop words by zero, but that is not as nice a solution.
>
> Right, or if you are using the Lucene extraction method, at Lucene indexing
> time.
>
>
Ok, so it seems I have to use the stop wording feature of Lucene itself,
right? I just saw there is something about stop words in Lucene but I am yet
to find out how to use that capability.


> >
> >
> >> ... But it would be valuable for me to be able
> >> to come back later to the complete context of a document (i.e. with the
> >> stopwords inside) - maybe it is a question on its own - how can I easily
> go
> >> back from clusters->original docs (an not just vectors), I do not know
> >> maybe
> >> some kind of mapper which maps vectors to the original documents somehow
> >> (e.g. sort of URL for a document based on the vector id/index or
> >> something?).
> >>
> >
> > To do this, you should use the document ID and just return the original
> > content from some other content store.  Lucene or especially SOLR can
> help
> > with this.
>
> Right, Mahout's vector can take labels.
>
> >
> >
> >> ...
> >> I think I will get better results if I can also apply stemming. What
> would
> >> be you recommendation when using mahout? Should I do the stemming again
> >> somewhere in the input vector forming?
> >
> >
> > Yes.  That is exactly correct.
>
> Again, really easy to do if you use the Lucene method for creating vectors.
>
> Do you mean I have to apply stemming during the vector creation or already
in Lucene indexing? Maybe from clustering POV it is the same but what would
you recommend?


> >
> > It is also really essential for me to have "updateable" algorithms as I
> am
> >> adding new documents on daily basis, and I definitely like to have them
> >> clustered immediately (incrementally) - I do not know if this is what is
> >> called "classification" in Mahout and I did not reach these examples yet
> (I
> >> wanted to really get acquainted with the clustering first).
> >>
> >
> > I can't comment on exactly how this should be done, but we definitely
> need
> > to support this use case.
>
> Don't people usually see if the new docs fit into an existing cluster and
> if they are a good fit, add them there, otherwise, maybe put them in the
> best match and kick off a new job.
>
> Actually this question goes back to the original attempt - to analyze
documents automatically by the machine, and not by people :). One of my
goals is to not read the new document but rather the system to tell me if I
should read it ;) - e.g. if it gets clustered/classified against given
cluster/topic which I am interested (not interested) in I could then take
more informed decision whether to read it (not to read it).

>
> >
> >
> >> And that is not all - I do not only want to have new documents clustered
> >> against existing clusters but what I want in addition is that clusters
> >> could
> >> actually change with new docs coming.
> >>
> >
> > Exactly.  This is easy algorithmically with k-means.  It just needs to be
> > supported by the software.
>
> Makes sense and shouldn't be that hard to do.  I'd imagine we just need to
> be able to use the centroids from the previous run as the seeds for the new
> run.
>
> >
> >
> >> Of course one could not observe new clusters popping up after a single
> new
> >> doc is added to the analysis but clusters should really be
> >> adaptable/updateable with new docs.
> >>
> >
> > Yes.  It is eminently doable.  Occasionally you should run back through
> all
> > of the document vectors so you can look at old documents in light of new
> > data but that should be very, very fast in your case.
>
>
I do not know how this updatable clustering works (using previous results as
centroids for new clusterings), is there an example I could see in action?
Additionally I would like to see an example of how could one combine Canopy
and k-means,  I just saw this described in theory somehow but could not find
an example of it.

Best regards,
Bogdan

Re: Clustering techniques, tips and tricks

Reply via email to