Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by GuntherHagleitner: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification ------------------------------------------------------------------------------ == Internal Changes == [[Anchor(Grunt_parser_(Phase_1))]] - ==== Grunt parser (Phase 1) ==== + ==== GruntParser/PigServer (Phase 1) ==== - The parser currently uses a bottom up approach. When it sees a store (dump, explain), it goes bottom up and generates the plan that needs to happen for this particular store. In order to optimize the multi-query example, we need, however, a peek on the entire graph for a script (interactive mode can be handled differently). + The parser currently uses a bottom up approach. When it sees a store (dump, explain), it goes bottom up and generates the plan that needs to happen for this particular store. In order to optimize the multi-query example, we need, however, a peek on the entire graph for a script (interactive mode is handled differently). - In order to do this we will change the batch mode of the parser to: + The highlevel changes to the parser/server are: * Not execute the plan when we see a store or dump * Alter the already existing merge functionality to allow intersecting graphs to be joined into a single logical plan. @@ -520, +520 @@ The new "run" command will simply feed all the lines of a script through the interactive mode. + The PigServer has a new interface: + * setBatchOn() + * By default batch mode is off and we are in interactive mode. setBatchOn starts a new batch for execution that will not execute on store. + * setBatchOn can be called multiple times and will produce a nested set of batches + * executeBatch() + * Whenever batch mode is on, execute will process all the stores that are currently in the batch and have not been processed before + * discardBatch() + * Removes the current batch and goes back to the previous batch (or interactive mode, if there are no more) + * isBatchOn() + * Tells weather batch mode is on + * isBatchEmpty() + * Helper function that tells whether there are any unprocessed stores in the current batch + + Internally all different batches and the interactive mode are represented as Graph objects. A graph object maintains a cache of the registered queries, keeps track of the logical plan and the processed/unprocessed stores. Graphs themselves can be interactive or batch. Interactive graphs will execute on store, batch graphs won't. + + The PigServer maintains a stack of these graph objects so that setBatchOn/discardBatch operations basically become push and pop operations. + + If the multi-query optimization is turned off all graphs will be generated as interactive, which is how we revert the behavior. + + The merging of the different logical plans is done in the OperatorPlan, where merges can now either check for disjoint graphs or merge them with overlaps. + + Finally, the store-load handling is done in the pig server. It will either transform the plan or add a store-load connection. Absolute filenames will be available, since the QueryParser now translates them when it sees them. + + The grunt parser makes use of the new PigServer APIs. It will use the batch API to parse and execute scripts. Since scripts can be nested, the script execution and the stack of graphs in the PigServer are closely related. + [[Anchor(Explain_(Phase_1))]] ==== Explain (Phase 1) ==== As described above the changes are: - * Add options to explain and illustrate to work on a script file as well as a handle. + * Add options to explain to work on a script file as well as a handle. - * Add the ability to print plans as dot files and to write explain and illustrate output to files. + * Add the ability to print plans as dot files and to write output to files. + * Allow for a "brief" options to control verbosity. - There will be some work to nicely represent the graphs resulting from explain in text form. Right now operators with multiple outputs will result in the ancestor tree be duplicated for each output. It might be nicer to show the ancestors once and mark the other places as copies of that one. + The explain command already supports writing to PrintStreams, so adding the capability of writing to files was fairly straight forward. + + Parameters for the desired output format as well as the verbosity were added to the explain calls in the execution engine, logical and physical plan. + + The dot output is realized by the DotPlanDumper. Since dot does not care about any specific order, we simply iterate over the nodes and edges and dump them. Nested plans are realized as subgraphs which will case a recursion in the DotPlanDumper. Printing a map reduce plan will first start with the MROperator plan and then recurse into the physical plans. + + Invisible nodes and edges are use to force the right layout of subgraphs. There is one invisible input and one invisible output per subgraph that are connected (via invisible edges) to the roots and leaves of the nested plan respectively. This way top to bottom layout of the nested graph is ensured. + + CoGroups are special in that their nested plans are connected to input operators. This is modeled as subgraphs as well and the subgraphs are connected to the respective input operator. [[Anchor(Local_Execution_engine_(Phase_1))]] ==== Local Execution engine (Phase 1) ====