It's faster to run directly on s3; we saw a 20% improvement but that's very dependent on data set. That assumes that your job only runs over the data once. If you ever run two jobs over the data then pulling it into hdfs is the perf winner. -- Elliott Clark
On Tue, Jun 19, 2012 at 9:56 PM, Yang <teddyyyy...@gmail.com> wrote: > on the other hand, I found that hadoop commands work with S3 file system > naturally, > so we could let our MR jobs directly consume S3, and directly dump out to > S3. > > > are there any speed/performance implications? a rough guess is that it's > probably going to save > a little if we access S3 directly, but not much different, since either a > separate copy or direct consumption > both have to go through the same pipe first. ??? >