Hey Dawid, Is it information retrieval from visual data you're working on? We have > recently > had a presentation about a guy who implemented motion detection on GPUs > with > very impressive speedups (orders of magnitude compared to normal CPUs). > I'm > wondering if your expertise here could be used to implement map-reduce > distributed jobs for running multiple GPUs in parallel. I know this sounds > a bit > crazy, but I've heard of bio-engineering companies doing just that -- > running a > cluster of GPUs to speed up their computations. Just a wild thought. Back > to > your proposal though.
Yes, it is basically information retrieval that I'm performing on sets of images- in fact, a lot of the best algorithms employed today for object detection, object retrieval, etc. are adaptations of basic text-retrieval approaches (e.g. tfidf-weighted vector space models). I've personally never worked with GPUs for image processing, but I imagine the vector processing abilities would be useful at almost every stage of the indexing and retrieval processes. I would be interested in looking into those possibilities in more details. > > mostly focused around approximate k-means algorithms (since that's a > problem > > I've been working on lately). It sounds like you guys are already > > implementing canopy clustering for k-means- Is there any interest in > > developing another approximation algorithm based on randomized kd-trees > for > > high dimensional data? What about mean-shift clustering? > > From my experience the largest challenge in data clustering is not > figuring out > a new clustering methodology, but finding the right existing one to tackle > a > particular problem. Isabel mentioned web spam detection challenge -- this > is a > good example of a multi-feature classification problem and I know people > have > tried clustering the host graph to come up with more coarse-grained > features for > hosts. From my own interest, a very interesting challenge is doing > something > like Google News does (event aggregation). This is less trivial than you > might > think at first -- most news are very similar to each other (copy/paste and > editing changes), so it's trivial to find small clusters of near-clones. > Then > the problem becomes more difficult because all news speak about pretty > much the > same people/ events (take presidential election in the U.S.). I think the > problems you could state here are: > > 1) approximating optimal clustering granularity (call it the number of > clusters > if you wish, although I think clustering should be driven by other factors > rather than just the number of clusters), > > 2) coming up with clusters of news items _other_ than keyword-based > similarity. > One example here is grouping news by region (geolocation), sentiment > (positive/ > negative news), people-related news, etc. > > 3) multilingual news matching and clustering. > > All the above issues are on the border of different domains -- NLP, > clustering, > classification. The tricky part is being able to put them together. What > would > be of interest to you? These are all interesting problems, actually. I've done some research into sentiment analysis, as you mentioned in (2), and I think it's still a wide open problem. Oren Etzioni at UWash does some interesting related work: www.cs.washington.edu/homes/etzioni/. I would basically be interested in doing anything that fits in well with the overall goals of the Mahout project. Whether that is implementing well known algorithms within the Hadoop framework or working on some novel idea is up to the mentors, I presume. Personally, if I'm going to be working on something novel, I would like to relate it to my current research work... and I'm happy to discuss that with anyone on the list who is interested. Matt > > > D. > > > > > Again, I would be glad to help in any way I can. > > > > Matt > > > > On Thu, Mar 6, 2008 at 12:56 AM, Isabel Drost <[EMAIL PROTECTED]> > > wrote: > > > >> On Saturday 01 March 2008, Grant Ingersoll wrote: > >>> Also, any thoughts on what we might want someone to do? I think it > >>> would be great to have someone implement one of the algorithms on our > >>> wiki. > >> Just as a general note, the deadline for applications: > >> > >> March 12: Mentoring organization application deadline (12 noon > PDT/19:00 > >> UTC). > >> > >> I suppose we should identify interesing tasks until that deadline. As a > >> general guideline for mentors and for project proposals: > >> > >> http://code.google.com/p/google-summer-of-code/wiki/AdviceforMentors > >> > >> Isabel > >> > >> -- > >> Better late than never. -- Titus Livius (Livy) > >> |\ _,,,---,,_ Web: <http://www.isabel-drost.de> > >> /,`.-'`' -. ;-;;,_ > >> |,4- ) )-,_..;\ ( `'-' > >> '---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]> > >> > > >