Streaming is good for simulation. Long running map-only processes, where pig doesn't really help and it is simple to fire off a streaming process. You do have to set some options so they can take a long time to return/return counters.
Russell Jurney http://datasyndrome.com On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn <iefin...@gmail.com> wrote: > I'm really interested in this as well. I have trouble seeing a really good > use case for streaming map-reduce. Is there something I can do in streaming > that I can't do in Pig? If I want to re-use previously made Python functions > from my code base, I can do that in Pig as much as Streaming, and from what > I've experienced thus far, Python streaming seems to go slower than or at the > same speed as Pig, so why would I want to write a whole lot of > more-difficult-to-read mappers and reducers when I can do equally fast > performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but > currently I just can't think of the right use case. > > Eli > > On 3/2/12 9:21 AM, Subir S wrote: >> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J<ha...@cloudera.com> wrote: >> >>> On Fri, Mar 2, 2012 at 10:18 AM, Subir S<subir.sasiku...@gmail.com> >>> wrote: >>>> Hello Folks, >>>> >>>> Are there any pointers to such comparisons between Apache Pig and Hadoop >>>> Streaming Map Reduce jobs? >>> I do not see why you seek to compare these two. Pig offers a language >>> that lets you write data-flow operations and runs these statements as >>> a series of MR jobs for you automatically (Making it a great tool to >>> use to get data processing done really quick, without bothering with >>> code), while streaming is something you use to write non-Java, simple >>> MR jobs. Both have their own purposes. >>> >> Basically we are comparing these two to see the benefits and how much they >> help in improving the productive coding time, without jeopardizing the >> performance of MR jobs. >> >> >>>> Also there was a claim in our company that Pig performs better than Map >>>> Reduce jobs? Is this true? Are there any such benchmarks available >>> Pig _runs_ MR jobs. It does do job design (and some data) >>> optimizations based on your queries, which is what may give it an edge >>> over designing elaborate flows of plain MR jobs with tools like >>> Oozie/JobControl (Which takes more time to do). But regardless, Pig >>> only makes it easy doing the same thing with Pig Latin statements for >>> you. >>> >> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become >> pretty slow with lot of joins, which we can achieve faster with writing raw >> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like >> for example what kind of projects should consider Pig. Say when we have a >> lot of Joins, which writing with plain MR jobs takes time. Thoughts? >> >> Thank you Harsh for your comments. They are helpful! >> >> >>> -- >>> Harsh J >>> >