[
https://issues.apache.org/jira/browse/PIG-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595061#action_12595061
]
Shravan Matthur Narayanamurthy commented on PIG-157:
----------------------------------------------------
Thanks for the comments Pi.
1) First concern is that using Hadoop Local will tie us to Hadoop too much.
There was an initiative quite a while ago to start looking at different
backends other than Hadoop (e.g. we might be running a backend like [EMAIL
PROTECTED] Who knows?).
However, this whole thing seems to have been built for solely Hadoop anyway.
Not sure about the current direction.
[shrav] I don't think this ties us down to Hadoop in the sense that we can't
have other backends. We just resue some hadoop code thats all. The only thing I
see tied to haddop is that at max we would need to supply the hadoop jar with
pig which we already do.
2) Have you tried to measure LocalHadoop startup time compared to the local
engine? If the LocalHadoop takes much more time to startup, we might suffer
when processing nested queries.
[shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are
processing even like 10 MB of data, the LocalHadoop mysteriously beats the
local engine hands down. For the local engine I presumed that it would just
take the leaf operator which will be a POStore and call the store() method.
For about 12MB of data, the LocalHadoop took about 11 sec whereas the local
engine took about 15 sec.
As far as the nested plan in foreach goes, at least currently, we won't be
creating an instance of a local engine to run the nested plan. Currently, all
operators that can be used inside the nested plan have been implemented such
that the generic plan execution model with attachInputs called on the inner
plan will work fine. However, if we decide to have all the operators inside the
nested plan, then we will have to do changes to the MRCompiler so that the
nested foreach becomes a blocking operator and should be handled separately by
spawning new MR jobs to process the plan inside. In this case, invoking
LocalHadoop would probably not make sense. The executable operator plan is a
better option here as it would also entail that there would not be any changes
to the MRCompiler.
So, at least now, LocalJobRunner will not be invoked inside the MapReduce
execution for executing nested plans. The LocalJobRunner will be strictly used
only when the user is in local execution mode.
I will update the wiki with these comments.
Thanks for the inputs Pi. I had not thought about the nested for each when it
grows full blown.
> Add types and rework execution pipeline
> ---------------------------------------
>
> Key: PIG-157
> URL: https://issues.apache.org/jira/browse/PIG-157
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: Core.patch.zip, exceptions.patch, incr1.zip
>
>
> This is the tracking bug for the work to add types to pig and rework the
> execution pipeline. Individual components of this work are covered in
> subtasks.
> Functional and design specs for this work are:
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> http://wiki.apache.org/pig/PigTypesDesign
> http://wiki.apache.org/pig/PigExecutionModel
> This work is being done on the branch types, since it is large and
> disruptive, and we want to be able to do incremental checkins without causing
> issues for the trunk.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.