Thanks for the clarification Ankur. Do you have any performance
comparison between pig-0.6 and Hadoop? I would be interested to look at
the same. Last I heard about the comparison was in
http://osdir.com/ml/hive-user-hadoop-apache/2009-06/msg00078.html.
Pig-0.7.0 seems interesting. Thanks for sharing the information. I am
looking forward to experiment with it.

Thanks
Pallavi 

-----Original Message-----
From: Ankur C. Goel [mailto:gan...@yahoo-inc.com] 
Sent: Wednesday, February 24, 2010 1:24 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

Pallavi,
      Thanks for your comments. Some clarifications w.r.t pig.

Pig does not generate any M/R code. What is it generates is logical,
physical and map-reduce plans that are nothing but DAGs. The map-reduce
plan is  then interpreted by pig's own mappers/reducers. The plan
generation itself is done on the client side and takes few seconds or
minutes (if you have a really big script).

About performance tuning in hadoop, all the M/R parameters can be
adjusted in pig to have the same effect they'd have in Java M/R
programs. Pig 0.7 is moving towards using hadoop's input/output format
in its load/store functions, so your custom i/o formats can be easily
reused with little additional effort.

Pig also provides very nice features like MultiQuery optimization and
skewed & merge join that are hard to implement in Java M/R every time
you need them.

With the latest pig release 0.6 the performance gap between Java M/R and
Pig has been narrowed to a good extent.

Simple statistical measures that you would use to understand or
preprocess your data are very easy to do with just few lines of pig code
and lot of utility UDFs are available for that.

Besides all the good things, I agree that there are compatibility issues
running pig-x on hadoop-y but this has also to do with new features of
Hadoop that pig is able to exploit in its pipeline.

I also agree with the general opinion that for Pig's adoption in Mahout
land it should play out well with Mahout's vector formats.

At the moment I don't have the proper free time to look into this but
will surely get back to evaluating the feasibility of this integration
in the coming few weeks. Till then any of the interested folks can fork
a JIRA for this and work on it.


On 2/24/10 12:27 PM, "Palleti, Pallavi" <pallavi.pall...@corp.aol.com>
wrote:

I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors.
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts.

Thanks
Pallavi

-----Original Message-----
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

>
> That isn't an issue here.  It is the invocation of pig programs and 
> passing useful information to them that is the problem.
>
>
> On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
<gan...@yahoo-inc.com>wrote:
>
>> Scripting ability while still limited has better streaming support so

>> you can have relations streamed Into a custom script executing in 
>> either map or reduce phase depending upon where it is placed.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>


--
Ted Dunning, CTO
DeepDyve

Reply via email to