Re: Some newbie questions- Mahout clustering

Gökhan Çapan Wed, 13 Jan 2010 04:34:20 -0800

Thanks for advice, Grant.

On Wed, Jan 13, 2010 at 2:02 PM, Grant Ingersoll <[email protected]>wrote:


>
> On Jan 13, 2010, at 4:33 AM, Gökhan Çapan wrote:
>
> > Hi,
> > We have a local news aggregation(and news search engine) web site, which
> > show news stories within a cluster (a cluster of news articles from
> > different news sites that are about the same(sometimes just very similar)
> > story).
> > For clustering the news of the last crawl(not results of search, news
> > themselves), we use Carrot2, and it works pretty good.
> >
> > However, we sometimes need to publish summary of the week/month/year.
> >
> > I am not experienced about clustering, and from what I read about
> clustering
> > in this mailing list, I guess applying kmeans to data after intelligently
> > selecting initial clusters with canopy will fulfill our needs.
>
> You can also try randomly selecting initial seeds or see some other threads
> about "kmeans++"
>
> > I have some questions about topic:
> > -Could anyone who is experienced about clustering stuff suggest me the
> > rightest way to detect news stories? Does the method I mentioned above
> seem
> > reasonable ?
>
> I think that way sounds reasonable, although publishing a summary may be
> tricky, depending on your needs.  I would try out the various clustering
> algorithms and see which one gives you the best performance.  Getting labels
> for the cluster is one thing, but a summary may be a whole other case.
>
> > -Do I need some initial work before clustering? Should I partition the
> data
> > into daily groups before clustering, for example?
>
> I would partition it into the length of time you want the results for
> (week/month/year)
>
> > (Again, in our case; a news story is an aggregated view of the
> > similar(nearly same) stories from different sources.)
> >
> > Finally, our search engine is built on Lucene/Solr. I've read our index
> may
> > be easily converted to Mahout vector format by lucene driver on Wiki
> pages.
>
> If you have stored term vectors for the documents in question, then yes.
>  Otherwise, no, you will not be able to.
>
> >
> > -Are the documents about clustering jobs in Wiki pages  applicable with
> > "trunk"? If they are out of date, is there anywhere that I can reach the
> > documents about trunk?
>
> I believe they are pretty stable at this point, but I haven't reviewed
> every last one.  Probably the best thing to do to see the inputs to the
> Driver is run the command with --help.
>
> -Grant
>
>


-- 
Gökhan Çapan

Re: Some newbie questions- Mahout clustering

Reply via email to