Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Eli Finkelshteyn Mon, 05 Mar 2012 12:38:52 -0800

I'm really interested in this as well. I have trouble seeing a reallygood use case for streaming map-reduce. Is there something I can do instreaming that I can't do in Pig? If I want to re-use previously madePython functions from my code base, I can do that in Pig as much asStreaming, and from what I've experienced thus far, Python streamingseems to go slower than or at the same speed as Pig, so why would I wantto write a whole lot of more-difficult-to-read mappers and reducers whenI can do equally fast performance-wise, shorter, and clearer code inPig? Maybe it's obvious, but currently I just can't think of the rightuse case.

Eli


On 3/2/12 9:21 AM, Subir S wrote:

On Fri, Mar 2, 2012 at 12:38 PM, Harsh J<ha...@cloudera.com>  wrote:

On Fri, Mar 2, 2012 at 10:18 AM, Subir S<subir.sasiku...@gmail.com>
wrote:

Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.

Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.

Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.

I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!

--
Harsh J

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Reply via email to