Thanks Robert, Niels Ye, I think text manipulation, especially ngram is a good application for me. Cheers
On Fri, May 20, 2011 at 12:57 AM, Robert Evans <[email protected]> wrote: > I'm not sure if this has been mentioned or not but in Machine Learning with > text based documents, the first stage is often a glorified word count > action. Except much of the time they will do N-Gram. So > > Map Input: > "Hello this is a test" > > Map Output: > "Hello" > "This" > "is" > "a" > "test" > "Hello" "this" > "this" "is" > "is" "a" > "a" "test" > ... > > > You may also be extracting all kinds of other features form the text, but > the tokenization/n-gram is not that CPU intensive. > > --Bobby Evans > > On 5/19/11 3:06 AM, "elton sky" <[email protected]> wrote: > > Hello, > I pick up this topic again, because what I am looking for is something not > CPU bound. Augmenting data for ETL and generating index are good examples. > Neither of them requires too much cpu time on map side. The main bottle > neck > for them is shuffle and merge. > > Market basket analysis is cpu intensive in map phase, for sampling all > possible combinations of items. > > I am still looking for more applications, which creates bigger output and > not CPU bound. > Any further idea? I appreciate. > > > On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[email protected]> wrote: > > > On 30/04/2011 05:31, elton sky wrote: > > > >> Thank you for suggestions: > >> > >> Weblog analysis, market basket analysis and generating search index. > >> > >> I guess for these applications we need more reduces than maps, for > >> handling > >> large intermediate output, isn't it. Besides, the input split for map > >> should > >> be smaller than usual, because the workload for spill and merge on > map's > >> local disk is heavy. > >> > > > > any form of rendering can generate very large images > > > > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf > > > > > > > >
