Hi Matthew,
student in electrical engineering at UT-Austin working in computer vision,
which is closely tied to many of the problems Mahout is addressing
(especially in my area of content-based retrieval).
Is it information retrieval from visual data you're working on? We have recently
had a presentation about a guy who implemented motion detection on GPUs with
very impressive speedups (orders of magnitude compared to normal CPUs). I'm
wondering if your expertise here could be used to implement map-reduce
distributed jobs for running multiple GPUs in parallel. I know this sounds a bit
crazy, but I've heard of bio-engineering companies doing just that -- running a
cluster of GPUs to speed up their computations. Just a wild thought. Back to
your proposal though.
mostly focused around approximate k-means algorithms (since that's a problem
I've been working on lately). It sounds like you guys are already
implementing canopy clustering for k-means- Is there any interest in
developing another approximation algorithm based on randomized kd-trees for
high dimensional data? What about mean-shift clustering?
From my experience the largest challenge in data clustering is not figuring out
a new clustering methodology, but finding the right existing one to tackle a
particular problem. Isabel mentioned web spam detection challenge -- this is a
good example of a multi-feature classification problem and I know people have
tried clustering the host graph to come up with more coarse-grained features for
hosts. From my own interest, a very interesting challenge is doing something
like Google News does (event aggregation). This is less trivial than you might
think at first -- most news are very similar to each other (copy/paste and
editing changes), so it's trivial to find small clusters of near-clones. Then
the problem becomes more difficult because all news speak about pretty much the
same people/ events (take presidential election in the U.S.). I think the
problems you could state here are:
1) approximating optimal clustering granularity (call it the number of clusters
if you wish, although I think clustering should be driven by other factors
rather than just the number of clusters),
2) coming up with clusters of news items _other_ than keyword-based similarity.
One example here is grouping news by region (geolocation), sentiment (positive/
negative news), people-related news, etc.
3) multilingual news matching and clustering.
All the above issues are on the border of different domains -- NLP, clustering,
classification. The tricky part is being able to put them together. What would
be of interest to you?
D.
Again, I would be glad to help in any way I can.
Matt
On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:
On Saturday 01 March 2008, Grant Ingersoll wrote:
Also, any thoughts on what we might want someone to do? I think it
would be great to have someone implement one of the algorithms on our
wiki.
Just as a general note, the deadline for applications:
March 12: Mentoring organization application deadline (12 noon PDT/19:00
UTC).
I suppose we should identify interesing tasks until that deadline. As a
general guideline for mentors and for project proposals:
http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors
Isabel
--
Better late than never. -- Titus Livius (Livy)
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]>