Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by GuntherHagleitner: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification ------------------------------------------------------------------------------ Both of the above are equivalent, but the performance will be different. - Here's what we plan to do to increase the performance: + Here's the multi-query feature does to increase the performance: - * In the second case we will add an implicit split to transform the query to case number one. That will eliminate the processing of A' multiple times. + * In the second case we add an implicit split to transform the query to case number one. That eliminates the processing of A' multiple times. - * Make the split non-blocking and allow processing to continue. This will help reduce the amount of data that has to be stored right at the split. + * Make the split non-blocking and allow processing to continue. This helps to reduce the amount of data that has to be stored right at the split. * Allow multiple outputs from a job. This way we can store some results as a side-effect. This is also necessary to make the previous item work. - * Allow multiple split branches to be carried on to the combiner/reducer. This will reduce the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. + * Allow multiple split branches to be carried on to the combiner/reducer. This reduces the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. [[Anchor(Storing_intermediate_results)]] ==== Storing intermediate results ==== @@ -80, +80 @@ [[Anchor(Execution_in_batch_mode)]] ==== Execution in batch mode ==== - Batch mode is entered when Pig is given a script to execute. Interactive mode is on the grunt shell ("grunt:>"). Right now there isn't much difference between them. In order for us to optimize the multi-query case, we'll need to distinguish the two more. + Batch mode is entered when Pig is given a script to execute (e.g.: invoking with a script as parameter, or using the "-e" or "-f" Pig options) Interactive mode is on the grunt shell ("grunt:>"). There wasn't much difference between them. In order for us to optimize the multi-query case, we needed to distinguish the two some more. - Right now whenever the parser sees a store (or dump, illustrate) it will kick of the execution of that part of the script. Part of this proposal is that in batch mode, we parse the entire script first and see if we can combine things to reduce the overall amount of work that needs to be done. Only after that will the execution start. + Right now whenever the parser sees a store (or dump, illustrate) it will kick of the execution of that part of the script. Part of the changes was to change batch mode to parse the entire script first and see if we can combine things to reduce the overall amount of work that needs to be done. Only after that will the execution start. - The following changes are proposed (in batch): + The high level changes are: - * Store will not trigger an immediate execution. The entire script is considered before the execution starts. + * Store does not trigger an immediate execution. The entire script is considered before the execution starts. - * Explicit splits will be put in places where a handle has multiple children. If the user wants to explicitly force re-computation of common ancestors she has to provide multiple scripts. - * Multiple split branches/stores in the script will be combined into the same job, if possible. Again, using multiple scripts is the way to go to avoid this (if that is desired). + * Explicit splits will be put in places where a handle has multiple children. + * Multiple split branches/stores in the script will be combined into the same job, if possible. Some problems with this: * Explain works on handles, which only gives you a slice of the entire script execution at a time. What's more, is that at the point they may occur in a script they might not give you an accurate picture about the situation, since the execution plans might change once the entire script is handled. * Debugging on the grunt shell is more complicated, since scripts run differently that what one might type on the shell. - The proposal therefore is: + Additional changes therefore are: * Add a run/exec commands to the shell to execute a script in interactive or batch mode for debugging. * Add scripts as a target (in additions to handles) to explain. @@ -112, +112 @@ run [-param <key>=<value>]* [-param_file <filename>] <script name> }}} - Which will run the script in interactive mode. + Which runs the script in interactive mode, so every store triggers execution. The statements from the script are put into the command history and all the handles [[Anchor(Exec)]] ==== Exec ==== @@ -153, +153 @@ Brief: - * Does not expand nested plans + * Does not expand nested plans (presenting a smaller graph for overview) Param/Param_file: * Allows for param substitution in scripts. + + [[Anchor(Turning_off_multi_query)]] + === Turning off multi query === + + By default the multi-query optimization is enabled and scripts execution will be handled accordingly. If it is desired + to turn off the optimization and revert to "execute-on-store" behavior, the "-M" or "-no_multiquery" switches can be used. + + In order to run script "foo.pig" without the optimization, execute pig as follows: + + {{{ + $ pig -M foo.pig + or: + $ pig -no_multiquery foo.pig + }}} + + If pig is launched in interactive mode with this switch "exec" statements are also going to run in interactive mode. + + [[Anchor(Incompatible_Changes)]] + === Incompatible Changes === + Most existing scripts produce the same result with or without the multi-query optimization. There are cases though + were this is not true. + + [[Anchor(Path_Names_and_Schemes)]] + ==== Path Names and Schemes ==== + Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change + throughout the script any path used in load or store is translated to a fully qualified and absolute path. + + In map-reduce mode, the following script: + + {{{ + cd /; + A = load 'foo'; + cd tmp; + store A into 'bar'; + }}} + + will load from "hdfs://<host>:<port>/foo" and store into "hdfs://<host>:<port>/tmp/bar". + + These expanded paths are going to be passed to any LoadFunc or Slicer implementation. In some cases, especially when + a LoadFunc/Slicer is not used to read from a dfs file or path (e.g.: loading from an SQL database), this can cause + problems. + + Solutions are to either: + * Specify "-M" or "-no_multiquery" to revert to the old names + * Specify a custom scheme for the LoadFunc/Slicer + + Arguments used in a load statement that have a scheme other than "hdfs" or "file" will not be expanded and passed + to the LoadFunc/Slicer unchanged. + + In the SQL case: + + {{{ + A = load "sql://footable" using SQLLoader(); + }}} + + Will invoke the SQLLoader function with "sql://footable". + + [[Anchor(HBaseStorage)]] + ==== HBaseStorage ==== + Scripts using the HBaseStorage loader will trigger a warning with the multi-query optimization turned on. The reason + is the same as described above. + + Scripts like: + + {{{ + A = load 'table' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('a','b'); + }}} + + Should be changed to: + + {{{ + A = load 'hbase://table' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('a','b'); + }}} + + To avoid the warning. Using "-M" or "-no_multiquery" will also remove the warning. + + [[Anchor(Implicit_Dependencies)]] + ==== Implicit Dependencies ==== + If a script has dependencies on the execution order outside of what Pig knows about, execution might fail. + + For instance: + + {{{ + ... + store A into 'foo'; + B = load 'bar'; + C = foreach B generate MYUDF($0,'foo'); + store C into 'baz'; + }}} + + MYUDF might try to read form the file foo, a file that handle A was just stored into. However, Pig does not know + that MYUDF depends on the file foo and might submit the jobs producing the files baz and foo at the same time, so + execution might fail. + + In order to make this work, the script has to be changed to: + + {{{ + ... + store A into 'foo'; + exec; + B = load 'bar'; + C = foreach B generate MYUDF($0,'foo'); + store C into 'baz'; + }}} + + The exec statment will trigger the execution of the job resulting in the file foo. This way the right execution order + is enforced. [[Anchor(Phases)]] == Phases ==