On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[email protected]>wrote:

>
> I would like to give some feedback. And ask some questions as well :).
>

Thank you!

Very helpful feedback.


> ... Carrot2 for 2 weeks ... has great level of
> usability and simplicity but ...I had to give up on it since my very first
> practical clustering task required to cluster 23K+ documents.


Not too surprising.


> ...
> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
> for something like 10 min (in standalone mode - no parallel processing at
> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but
> I
> am still learning and still trying to analyze if the result clusters are
> really meaningful for my docs.
>

I have seen this effect before where a map-reduce program run sequentially
is much faster than an all-in-memory implementation.


> One thing I can tell already now is that I definitely, desperately, need
> word-stopping


You should be able to do this in the document -> vector conversion.  You
could also do this at the vector level by multiplying the coordinates of all
stop words by zero, but that is not as nice a solution.


> ... But it would be valuable for me to be able
> to come back later to the complete context of a document (i.e. with the
> stopwords inside) - maybe it is a question on its own - how can I easily go
> back from clusters->original docs (an not just vectors), I do not know
> maybe
> some kind of mapper which maps vectors to the original documents somehow
> (e.g. sort of URL for a document based on the vector id/index or
> something?).
>

To do this, you should use the document ID and just return the original
content from some other content store.  Lucene or especially SOLR can help
with this.


> ...
> I think I will get better results if I can also apply stemming. What would
> be you recommendation when using mahout? Should I do the stemming again
> somewhere in the input vector forming?


Yes.  That is exactly correct.

It is also really essential for me to have "updateable" algorithms as I am
> adding new documents on daily basis, and I definitely like to have them
> clustered immediately (incrementally) - I do not know if this is what is
> called "classification" in Mahout and I did not reach these examples yet (I
> wanted to really get acquainted with the clustering first).
>

I can't comment on exactly how this should be done, but we definitely need
to support this use case.


> And that is not all - I do not only want to have new documents clustered
> against existing clusters but what I want in addition is that clusters
> could
> actually change with new docs coming.
>

Exactly.  This is easy algorithmically with k-means.  It just needs to be
supported by the software.


> Of course one could not observe new clusters popping up after a single new
> doc is added to the analysis but clusters should really be
> adaptable/updateable with new docs.
>

Yes.  It is eminently doable.  Occasionally you should run back through all
of the document vectors so you can look at old documents in light of new
data but that should be very, very fast in your case.

Reply via email to