Re: Algorithm implementations in Pig

Ankur C. Goel Tue, 23 Feb 2010 23:55:42 -0800

Pallavi,
      Thanks for your comments. Some clarifications w.r.t pig.

Pig does not generate any M/R code. What is it generates is logical, physical 
and map-reduce plans that are nothing but DAGs. The map-reduce plan is  then 
interpreted by pig's own mappers/reducers. The plan generation itself is done 
on the client side and takes few seconds or minutes (if you have a really big 
script).

About performance tuning in hadoop, all the M/R parameters can be adjusted in 
pig to have the same effect they'd have in Java M/R programs. Pig 0.7 is moving 
towards using hadoop's input/output format in its load/store functions, so your 
custom i/o formats can be easily reused with little additional effort.

Pig also provides very nice features like MultiQuery optimization and skewed & 
merge join that are hard to implement in Java M/R every time you need them.

With the latest pig release 0.6 the performance gap between Java M/R and Pig 
has been narrowed to a good extent.

Simple statistical measures that you would use to understand or preprocess your 
data are very easy to do with just few lines of pig code and lot of utility 
UDFs are available for that.

Besides all the good things, I agree that there are compatibility issues 
running pig-x on hadoop-y but this has also to do with new features of Hadoop 
that pig is able to exploit in its pipeline.

I also agree with the general opinion that for Pig's adoption in Mahout land it 
should play out well with Mahout's vector formats.

At the moment I don't have the proper free time to look into this but will 
surely get back to evaluating the feasibility of this integration in the coming 
few weeks. Till then any of the interested folks can fork a JIRA for this and 
work on it.

On 2/24/10 12:27 PM, "Palleti, Pallavi" <pallavi.pall...@corp.aol.com> wrote:

I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors.
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts.

Thanks
Pallavi

-----Original Message-----
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning <ted.dunn...@gmail.com>
wrote:

>
> That isn't an issue here.  It is the invocation of pig programs and
> passing useful information to them that is the problem.
>
>
> On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
<gan...@yahoo-inc.com>wrote:
>
>> Scripting ability while still limited has better streaming support so

>> you can have relations streamed Into a custom script executing in
>> either map or reduce phase depending upon where it is placed.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>

--
Ted Dunning, CTO
DeepDyve

Re: Algorithm implementations in Pig

Reply via email to