Thanks for the clarification Ankur. Do you have any performance comparison between pig-0.6 and Hadoop? I would be interested to look at the same. Last I heard about the comparison was in http://osdir.com/ml/hive-user-hadoop-apache/2009-06/msg00078.html. Pig-0.7.0 seems interesting. Thanks for sharing the information. I am looking forward to experiment with it.
Thanks Pallavi -----Original Message----- From: Ankur C. Goel [mailto:gan...@yahoo-inc.com] Sent: Wednesday, February 24, 2010 1:24 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig Pallavi, Thanks for your comments. Some clarifications w.r.t pig. Pig does not generate any M/R code. What is it generates is logical, physical and map-reduce plans that are nothing but DAGs. The map-reduce plan is then interpreted by pig's own mappers/reducers. The plan generation itself is done on the client side and takes few seconds or minutes (if you have a really big script). About performance tuning in hadoop, all the M/R parameters can be adjusted in pig to have the same effect they'd have in Java M/R programs. Pig 0.7 is moving towards using hadoop's input/output format in its load/store functions, so your custom i/o formats can be easily reused with little additional effort. Pig also provides very nice features like MultiQuery optimization and skewed & merge join that are hard to implement in Java M/R every time you need them. With the latest pig release 0.6 the performance gap between Java M/R and Pig has been narrowed to a good extent. Simple statistical measures that you would use to understand or preprocess your data are very easy to do with just few lines of pig code and lot of utility UDFs are available for that. Besides all the good things, I agree that there are compatibility issues running pig-x on hadoop-y but this has also to do with new features of Hadoop that pig is able to exploit in its pipeline. I also agree with the general opinion that for Pig's adoption in Mahout land it should play out well with Mahout's vector formats. At the moment I don't have the proper free time to look into this but will surely get back to evaluating the feasibility of this integration in the coming few weeks. Till then any of the interested folks can fork a JIRA for this and work on it. On 2/24/10 12:27 PM, "Palleti, Pallavi" <pallavi.pall...@corp.aol.com> wrote: I too have mixed opinion w.r.t pig. Pig would be a good choice to quickly prototype and test. However, following are the pitfalls I have observed in pig. It is not easy to debug in pig. Also, it have performance issues as it is a layer on top of hadoop, so the overhead of converting pig into map-reduce code. Also, when the code is available in hadoop, it is in developer/user's hand to improve the performance by using various parameters say, no of mappers, different input formats, etc. However is not the case with pig. Also,there are some compatibility issues with pig and hadoop. Say, if I am using pig-x version on hadoop-y version, there might be some compatibility issues and need to spend time on resolving the same as it is not easy to figure out the errors. I believe the main motto of mahout is to propose scalable algorithms which can be used to solve some real world problems. In such case, if pig has got rid of above pitfalls, then it would be good choice as we will have very less developing time efforts. Thanks Pallavi -----Original Message----- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, February 22, 2010 11:32 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > That isn't an issue here. It is the invocation of pig programs and > passing useful information to them that is the problem. > > > On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel <gan...@yahoo-inc.com>wrote: > >> Scripting ability while still limited has better streaming support so >> you can have relations streamed Into a custom script executing in >> either map or reduce phase depending upon where it is placed. >> > > > > -- > Ted Dunning, CTO > DeepDyve > > -- Ted Dunning, CTO DeepDyve