Re: performance

Theodore Van Rooy Wed, 12 Mar 2008 08:36:43 -0700

I have been using the HDFS, setting the block size to some appropriate level
and the replication as well.  When submitting the job keep in mind that each
block of the file in the HDFS will be passed into your mapping script as
Standard Input.  The datafile calls will be done locally if possible.  This
gives you a lot of options in regard to your replication and block size
settings.


Overall, it's very possible to optimize mapReduce for your specific job, you
just have to know how it does things.  Root around inside the file system
and watch it as it loads up the actual jobs.

Check out the streaming documentation for more idea on how to optimize your
streaming experience.

On Wed, Mar 12, 2008 at 9:21 AM, Jason Rennie <[EMAIL PROTECTED]> wrote:

> Hmm... sounds promising :)  How do you distribute the data?  Do you use
> HDFS?  Pass the data directly to the individual nodes?  We really only
> need
> to do the map operation like you.  We need to distribute a matrix * vector
> operation, so we want rows of the matrix distributed across different
> nodes.  Map could perform all the dot-products, which is the heavy lifting
> in what we're trying to do.  Might want to do a reduce after that, not
> sure...
>
> Jason
>
> On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> wrote:
>
> > There is overhead in grabbing local data, moving it in and out of the
> > system
> > and especially if you are running a map reduce job (like wc) which ends
> up
> > mapping, sorting, copying, reducing, and writing again.
> >
> > One way I've found to get around the overhead is to use Hadoop streaming
> > and
> > perform map only tasks.  While they recommend doing it properly with
> >
> > hstream -mapper /bin/cat -reducer /bin/wc
> >
> > I tried:
> >
> > hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> > -numReduceTasks 0
> >
> > (hstream is just an alias to do Hadoop streaming)
> >
> > And saw an immediate speedup on a 1 Gig and 10 Gig file.
> >
> > In the end you may have several output files with the wordcount for each
> > file, but adding those files together is pretty quick and easy.
> >
> > My recommendation is to explore how how you can get away with either
> > Identity Reduces, Maps or no reduces at all.
> >
> > Theo
> >
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: performance

Reply via email to