Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Russell Jurney Mon, 05 Mar 2012 12:58:55 -0800

Streaming is good for simulation. Long running map-only processes, where pig 
doesn't really help and it is simple to fire off a streaming process.  You do 
have to set some options so they can take a long time to return/return counters.


Russell Jurney http://datasyndrome.com

On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn <iefin...@gmail.com> wrote:

> I'm really interested in this as well. I have trouble seeing a really good 
> use case for streaming map-reduce. Is there something I can do in streaming 
> that I can't do in Pig? If I want to re-use previously made Python functions 
> from my code base, I can do that in Pig as much as Streaming, and from what 
> I've experienced thus far, Python streaming seems to go slower than or at the 
> same speed as Pig, so why would I want to write a whole lot of 
> more-difficult-to-read mappers and reducers when I can do equally fast 
> performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but 
> currently I just can't think of the right use case.
> 
> Eli
> 
> On 3/2/12 9:21 AM, Subir S wrote:
>> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J<ha...@cloudera.com>  wrote:
>> 
>>> On Fri, Mar 2, 2012 at 10:18 AM, Subir S<subir.sasiku...@gmail.com>
>>> wrote:
>>>> Hello Folks,
>>>> 
>>>> Are there any pointers to such comparisons between Apache Pig and Hadoop
>>>> Streaming Map Reduce jobs?
>>> I do not see why you seek to compare these two. Pig offers a language
>>> that lets you write data-flow operations and runs these statements as
>>> a series of MR jobs for you automatically (Making it a great tool to
>>> use to get data processing done really quick, without bothering with
>>> code), while streaming is something you use to write non-Java, simple
>>> MR jobs. Both have their own purposes.
>>> 
>> Basically we are comparing these two to see the benefits and how much they
>> help in improving the productive coding time, without jeopardizing the
>> performance of MR jobs.
>> 
>> 
>>>> Also there was a claim in our company that Pig performs better than Map
>>>> Reduce jobs? Is this true? Are there any such benchmarks available
>>> Pig _runs_ MR jobs. It does do job design (and some data)
>>> optimizations based on your queries, which is what may give it an edge
>>> over designing elaborate flows of plain MR jobs with tools like
>>> Oozie/JobControl (Which takes more time to do). But regardless, Pig
>>> only makes it easy doing the same thing with Pig Latin statements for
>>> you.
>>> 
>> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
>> pretty slow with lot of joins, which we can achieve faster with writing raw
>> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
>> for example what kind of projects should consider Pig. Say when we have a
>> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>> 
>> Thank you Harsh for your comments. They are helpful!
>> 
>> 
>>> --
>>> Harsh J
>>> 
>

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Reply via email to