Thanks for advice, Grant. On Wed, Jan 13, 2010 at 2:02 PM, Grant Ingersoll <[email protected]>wrote:
> > On Jan 13, 2010, at 4:33 AM, Gökhan Çapan wrote: > > > Hi, > > We have a local news aggregation(and news search engine) web site, which > > show news stories within a cluster (a cluster of news articles from > > different news sites that are about the same(sometimes just very similar) > > story). > > For clustering the news of the last crawl(not results of search, news > > themselves), we use Carrot2, and it works pretty good. > > > > However, we sometimes need to publish summary of the week/month/year. > > > > I am not experienced about clustering, and from what I read about > clustering > > in this mailing list, I guess applying kmeans to data after intelligently > > selecting initial clusters with canopy will fulfill our needs. > > You can also try randomly selecting initial seeds or see some other threads > about "kmeans++" > > > I have some questions about topic: > > -Could anyone who is experienced about clustering stuff suggest me the > > rightest way to detect news stories? Does the method I mentioned above > seem > > reasonable ? > > I think that way sounds reasonable, although publishing a summary may be > tricky, depending on your needs. I would try out the various clustering > algorithms and see which one gives you the best performance. Getting labels > for the cluster is one thing, but a summary may be a whole other case. > > > -Do I need some initial work before clustering? Should I partition the > data > > into daily groups before clustering, for example? > > I would partition it into the length of time you want the results for > (week/month/year) > > > (Again, in our case; a news story is an aggregated view of the > > similar(nearly same) stories from different sources.) > > > > Finally, our search engine is built on Lucene/Solr. I've read our index > may > > be easily converted to Mahout vector format by lucene driver on Wiki > pages. > > If you have stored term vectors for the documents in question, then yes. > Otherwise, no, you will not be able to. > > > > > -Are the documents about clustering jobs in Wiki pages applicable with > > "trunk"? If they are out of date, is there anywhere that I can reach the > > documents about trunk? > > I believe they are pretty stable at this point, but I haven't reviewed > every last one. Probably the best thing to do to see the inputs to the > Driver is run the command with --help. > > -Grant > > -- Gökhan Çapan
