On Jan 13, 2010, at 4:33 AM, Gökhan Çapan wrote: > Hi, > We have a local news aggregation(and news search engine) web site, which > show news stories within a cluster (a cluster of news articles from > different news sites that are about the same(sometimes just very similar) > story). > For clustering the news of the last crawl(not results of search, news > themselves), we use Carrot2, and it works pretty good. > > However, we sometimes need to publish summary of the week/month/year. > > I am not experienced about clustering, and from what I read about clustering > in this mailing list, I guess applying kmeans to data after intelligently > selecting initial clusters with canopy will fulfill our needs.
You can also try randomly selecting initial seeds or see some other threads about "kmeans++" > I have some questions about topic: > -Could anyone who is experienced about clustering stuff suggest me the > rightest way to detect news stories? Does the method I mentioned above seem > reasonable ? I think that way sounds reasonable, although publishing a summary may be tricky, depending on your needs. I would try out the various clustering algorithms and see which one gives you the best performance. Getting labels for the cluster is one thing, but a summary may be a whole other case. > -Do I need some initial work before clustering? Should I partition the data > into daily groups before clustering, for example? I would partition it into the length of time you want the results for (week/month/year) > (Again, in our case; a news story is an aggregated view of the > similar(nearly same) stories from different sources.) > > Finally, our search engine is built on Lucene/Solr. I've read our index may > be easily converted to Mahout vector format by lucene driver on Wiki pages. If you have stored term vectors for the documents in question, then yes. Otherwise, no, you will not be able to. > > -Are the documents about clustering jobs in Wiki pages applicable with > "trunk"? If they are out of date, is there anywhere that I can reach the > documents about trunk? I believe they are pretty stable at this point, but I haven't reviewed every last one. Probably the best thing to do to see the inputs to the Driver is run the command with --help. -Grant
