Apache Wiki
Wed, 07 May 2008 15:39:15 -0700
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by Shravan Narayanamurthy: http://wiki.apache.org/pig/LocalJobRunner ------------------------------------------------------------------------------ * These have been fixed however in hadoop-16 * Not sure how this will affect Example generator + Pi Song had some interesting observations: + 1) Will the LocalJobRunner be invoked when processing the nested plan inside foreach? + Currently, no. We have local versions of operators that are allowed inside the nested plan which can be used for running tuples through the plan. However, later if we intend to support a full blown foreach with arbitrary nesting and all operators supported, we can take two approaches: + i. Have local version of all operators and just use the current model to run tuples through. This also entails that we would not have to change anything in the MRCompiler. + ii. Change MRCompiler to process nested foreach as a blocking operator and recursilvely process it creating a list of dependent jobs. In this case, it probably would make more sense to run it in MapReduce itself and not locally for the nested plan. However, this can be a choice and the MapReduce Launcher can decide to execute these plans either locally by invoking the LocalJobRunner or the Hadoop Job Tracker based on the input size for the plans. + + 2) Will the invocation of LocalJobRunner have some latency? + Definitely it does. As measured in hadoop 15, it has about 5 sec startup latency. Whether this affects depends on how and where we are using LocalJobRunner. If we strictly use it only when the user asks for local execution mode it should not matter. Also if the size of the data is at least in 10s of MBs, the LocalJobRunner performs better than streaming tuples through the plan of local operators. + + I guess the choice is harder now :) + The choice now depends on what we want to do for the full blown foreach. Since I would like to implement choice (ii), I would vote for using LocalJobRunner. +