Re: Google Summer of Code

Dawid Weiss Thu, 06 Mar 2008 00:46:50 -0800


Hi Matthew,

student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

Is it information retrieval from visual data you're working on? We have recentlyhad a presentation about a guy who implemented motion detection on GPUs withvery impressive speedups (orders of magnitude compared to normal CPUs). I'mwondering if your expertise here could be used to implement map-reducedistributed jobs for running multiple GPUs in parallel. I know this sounds a bitcrazy, but I've heard of bio-engineering companies doing just that -- running acluster of GPUs to speed up their computations. Just a wild thought. Back toyour proposal though.

mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?

From my experience the largest challenge in data clustering is not figuring outa new clustering methodology, but finding the right existing one to tackle aparticular problem. Isabel mentioned web spam detection challenge -- this is agood example of a multi-feature classification problem and I know people havetried clustering the host graph to come up with more coarse-grained features forhosts. From my own interest, a very interesting challenge is doing somethinglike Google News does (event aggregation). This is less trivial than you mightthink at first -- most news are very similar to each other (copy/paste andediting changes), so it's trivial to find small clusters of near-clones. Thenthe problem becomes more difficult because all news speak about pretty much thesame people/ events (take presidential election in the U.S.). I think theproblems you could state here are:

1) approximating optimal clustering granularity (call it the number of clustersif you wish, although I think clustering should be driven by other factorsrather than just the number of clusters),

2) coming up with clusters of news items _other_ than keyword-based similarity.One example here is grouping news by region (geolocation), sentiment (positive/negative news), people-related news, etc.


3) multilingual news matching and clustering.

All the above issues are on the border of different domains -- NLP, clustering,classification. The tricky part is being able to put them together. What wouldbe of interest to you?

D.


Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:

On Saturday 01 March 2008, Grant Ingersoll wrote:

Also, any thoughts on what we might want someone to do?  I think it
would be great to have someone implement one of the algorithms on our
wiki.

Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/19:00
UTC).

I suppose we should identify interesing tasks until that deadline. As a
general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

--
Better late than never.         -- Titus Livius (Livy)
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
 /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>

Re: Google Summer of Code

Reply via email to