I'm really interested in this as well. I have trouble seeing a really
good use case for streaming map-reduce. Is there something I can do in
streaming that I can't do in Pig? If I want to re-use previously made
Python functions from my code base, I can do that in Pig as much as
Streaming, and from what I've experienced thus far, Python streaming
seems to go slower than or at the same speed as Pig, so why would I want
to write a whole lot of more-difficult-to-read mappers and reducers when
I can do equally fast performance-wise, shorter, and clearer code in
Pig? Maybe it's obvious, but currently I just can't think of the right
use case.
Eli
On 3/2/12 9:21 AM, Subir S wrote:
On Fri, Mar 2, 2012 at 12:38 PM, Harsh J<ha...@cloudera.com> wrote:
On Fri, Mar 2, 2012 at 10:18 AM, Subir S<subir.sasiku...@gmail.com>
wrote:
Hello Folks,
Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?
I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.
Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.
Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available
Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.
I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?
Thank you Harsh for your comments. They are helpful!
--
Harsh J