You should directly provide S3 path.
Actually what happens: When you submit MR job on AWS cluster providing data
in S3, Hadoop implicitly copies data to HDFS and then run the MR job

These article illustrates the actual use case and solution:
http://www.technology-mania.com/2012/05/s3-instead-of-hdfs-with-hadoop_05.html
http://www.technology-mania.com/2011/05/s3-as-input-or-output-for-hadoop-mr.html


On Wed, Jun 20, 2012 at 11:38 PM, Elliott Clark <
elliott.neil.cl...@gmail.com> wrote:

> It's faster to run directly on s3; we saw a 20% improvement but that's very
> dependent on data set. That assumes that your job only runs over the data
> once.  If you ever run two jobs over the data then pulling it into hdfs is
> the perf winner.
> --
> Elliott Clark
>
>
> On Tue, Jun 19, 2012 at 9:56 PM, Yang <teddyyyy...@gmail.com> wrote:
>
> > on the other hand, I found that hadoop commands work with S3 file system
> > naturally,
> > so we could let our MR jobs directly consume S3, and directly dump out to
> > S3.
> >
> >
> > are there any speed/performance implications? a rough guess is that it's
> > probably going to save
> > a little if we access S3 directly, but not much different, since either a
> > separate copy or direct consumption
> > both have to go through the same pipe first. ???
> >
>



-- 
*Regards*,
Rahul Patodi

Reply via email to