[jira] Commented: (PIG-157) Add types and rework execution pipeline

Shravan Matthur Narayanamurthy (JIRA) Wed, 07 May 2008 15:16:19 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595061#action_12595061
 ]


Shravan Matthur Narayanamurthy commented on PIG-157:
----------------------------------------------------

Thanks for the comments Pi.

1) First concern is that using Hadoop Local will tie us to Hadoop too much.
There was an initiative quite a while ago to start looking at different 
backends other than Hadoop (e.g. we might be running a backend like [EMAIL 
PROTECTED] Who knows?).

However, this whole thing seems to have been built for solely Hadoop anyway. 
Not sure about the current direction.
[shrav] I don't think this ties us down to Hadoop in the sense that we can't 
have other backends. We just resue some hadoop code thats all. The only thing I 
see tied to haddop is that at max we would need to supply the hadoop jar with 
pig which we already do.

2) Have you tried to measure LocalHadoop startup time compared to the local 
engine? If the LocalHadoop takes much more time to startup, we might suffer 
when processing nested queries.
[shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are 
processing even like 10 MB of data, the LocalHadoop mysteriously beats the 
local engine hands down. For the local engine I presumed that it would just 
take the leaf operator which will be a POStore and call the store() method.
For about 12MB of data, the LocalHadoop took about 11 sec whereas the local 
engine took about 15 sec.

As far as the nested plan in foreach goes, at least currently, we won't be 
creating an instance of a local engine to run the nested plan. Currently, all 
operators that can be used inside the nested plan have been implemented such 
that the generic plan execution model with attachInputs called on the inner 
plan will work fine. However, if we decide to have all the operators inside the 
nested plan, then we will have to do changes to the MRCompiler so that the 
nested foreach becomes a blocking operator and should be handled separately by 
spawning new MR jobs to process the plan inside. In this case, invoking 
LocalHadoop would probably not make sense. The executable operator plan is a 
better option here as it would also entail that there would not be any changes 
to the MRCompiler. 

So, at least now, LocalJobRunner will not be invoked inside the MapReduce 
execution for executing nested plans. The LocalJobRunner will be strictly used 
only when the user is in local execution mode.

I will update the wiki with these comments.
Thanks for the inputs Pi. I had not thought about the nested for each when it 
grows full blown.

> Add types and rework execution pipeline
> ---------------------------------------
>
>                 Key: PIG-157
>                 URL: https://issues.apache.org/jira/browse/PIG-157
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: Core.patch.zip, exceptions.patch, incr1.zip
>
>
> This is the tracking bug for the work to add types to pig and rework the 
> execution pipeline.  Individual components of this work are covered in 
> subtasks.
> Functional and design specs for this work are:
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> http://wiki.apache.org/pig/PigTypesDesign
> http://wiki.apache.org/pig/PigExecutionModel
> This work is being done on the branch types, since it is large and 
> disruptive, and we want to be able to do incremental checkins without causing 
> issues for the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-157) Add types and rework execution pipeline

Reply via email to