Thanks Robert, Niels

Ye, I think text manipulation, especially ngram is a good application for
me.
Cheers

On Fri, May 20, 2011 at 12:57 AM, Robert Evans <[email protected]> wrote:

> I'm not sure if this has been mentioned or not but in Machine Learning with
> text based documents, the first stage is often a glorified word count
> action.  Except much of the time they will do N-Gram.  So
>
> Map Input:
> "Hello this is a test"
>
> Map Output:
> "Hello"
> "This"
> "is"
> "a"
> "test"
> "Hello" "this"
> "this" "is"
> "is" "a"
> "a" "test"
> ...
>
>
> You may also be extracting all kinds of other features form the text, but
> the tokenization/n-gram is not that CPU intensive.
>
> --Bobby Evans
>
> On 5/19/11 3:06 AM, "elton sky" <[email protected]> wrote:
>
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle
> neck
> for them is shuffle and merge.
>
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
>
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
>
>
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[email protected]> wrote:
>
> > On 30/04/2011 05:31, elton sky wrote:
> >
> >> Thank you for suggestions:
> >>
> >> Weblog analysis, market basket analysis and generating search index.
> >>
> >> I guess for these applications we need more reduces than maps, for
> >> handling
> >> large intermediate output, isn't it. Besides, the input split for map
> >> should
> >> be smaller than usual,  because the workload for spill and merge on
> map's
> >> local disk is heavy.
> >>
> >
> > any form of rendering can generate very large images
> >
> > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
> >
> >
> >
>
>

Reply via email to