Hi Matthew,

student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).

Is it information retrieval from visual data you're working on? We have recently had a presentation about a guy who implemented motion detection on GPUs with very impressive speedups (orders of magnitude compared to normal CPUs). I'm wondering if your expertise here could be used to implement map-reduce distributed jobs for running multiple GPUs in parallel. I know this sounds a bit crazy, but I've heard of bio-engineering companies doing just that -- running a cluster of GPUs to speed up their computations. Just a wild thought. Back to your proposal though.

mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?

From my experience the largest challenge in data clustering is not figuring out a new clustering methodology, but finding the right existing one to tackle a particular problem. Isabel mentioned web spam detection challenge -- this is a good example of a multi-feature classification problem and I know people have tried clustering the host graph to come up with more coarse-grained features for hosts. From my own interest, a very interesting challenge is doing something like Google News does (event aggregation). This is less trivial than you might think at first -- most news are very similar to each other (copy/paste and editing changes), so it's trivial to find small clusters of near-clones. Then the problem becomes more difficult because all news speak about pretty much the same people/ events (take presidential election in the U.S.). I think the problems you could state here are:

1) approximating optimal clustering granularity (call it the number of clusters if you wish, although I think clustering should be driven by other factors rather than just the number of clusters),

2) coming up with clusters of news items _other_ than keyword-based similarity. One example here is grouping news by region (geolocation), sentiment (positive/ negative news), people-related news, etc.

3) multilingual news matching and clustering.

All the above issues are on the border of different domains -- NLP, clustering, classification. The tricky part is being able to put them together. What would be of interest to you?

D.


Again, I would be glad to help in any way I can.

Matt

On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:

On Saturday 01 March 2008, Grant Ingersoll wrote:
Also, any thoughts on what we might want someone to do?  I think it
would be great to have someone implement one of the algorithms on our
wiki.
Just as a general note, the deadline for applications:

March 12: Mentoring organization application deadline (12 noon PDT/19:00
UTC).

I suppose we should identify interesing tasks until that deadline. As a
general guideline for mentors and for project proposals:

http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors

Isabel

--
Better late than never.         -- Titus Livius (Livy)
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
 /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>


Reply via email to