There is a switch to turn off all optimizations. I don't recall what it is off the top of my head, but you can easily find out by running pig -h. Does that help?
On Mar 12, 2012, at 2:58 PM, Yongzhi Wang <[email protected]> wrote: > I found in the Physical plan, there are 3 split operator generated. I think > that's the reason of 3 map tasks. > > Is that possible that in the future pig can provide a parameter or syntax > to determine whether optimization will be launched? Sometimes the one to > one translating from pig script to the MapReduce plan is useful, since the > user can control the execution MapReduce plan by manipulating the script. > > Thanks! > > On Fri, Mar 9, 2012 at 2:02 PM, Dmitriy Ryaboy <[email protected]> wrote: > >> Pig always runs the same piece of code, generically speaking. There is >> no codegen. What actually happens is driven by serialized DAGs of the >> physical operators. >> This does seem like a bug (or an inefficiency, at least), we shouldn't >> need 3 mappers to do 3 filters of the same data and re-group. >> >> D >> >> On Thu, Mar 8, 2012 at 7:51 PM, Yongzhi Wang <[email protected]> >> wrote: >>> So two map tasks should running same piece of code, but reading different >>> input? >>> Or two tasks actually running different code? >>> Is there any way that I can track the real map reduce functions that the >>> pig parsed to the worker? >>> Or can you tell me which piece of source code in the pig project generate >>> the map and reduce tasks parsed to the slave worker? >>> >>> Thanks! >>> >>> >>> On Thu, Mar 8, 2012 at 8:23 PM, Dmitriy Ryaboy <[email protected]> >> wrote: >>> >>>> That's what I get for reading explain plans on an iphone. Sorry. >>>> >>>> So, yeah, the cogrouping is happening as part of the shuffle. >>>> It seems like Pig's figuring a task per t1 and t2, (and then a logical >>>> union of the two, which is just to indicate that tuples from both >>>> relations go into the same meta-relation tagged with source, which >>>> will then get cogrouped). It shouldn't, it should be able to reuse the >>>> same scan of the source data for both t1 and t2. >>>> >>>> D >>>> >>>> On Thu, Mar 8, 2012 at 9:13 AM, Yongzhi Wang < >> [email protected]> >>>> wrote: >>>>> Thanks, Dmitriy. I understand that there is only one job containing 2 >> map >>>>> tasks and 1 reduce tasks. But the problem is even if I only have one >>>> input >>>>> file with the size of 1.4k, (less than 50 rows of records), the stats >>>> data >>>>> still shows it needs 2 map tasks. >>>>> >>>>> The union operation is shown in the top of the Map plan tree: >>>> (Union[tuple] >>>>> - scope-85) >>>>> >>>>> #-------------------------------------------------- >>>>> # Map Reduce Plan >>>>> #-------------------------------------------------- >>>>> MapReduce node scope-84 >>>>> Map Plan >>>>> Union[tuple] - scope-85 >>>>> | >>>>> |---result: Local Rearrange[tuple]{bytearray}(false) - scope-73 >>>>> | | | >>>>> | | Project[bytearray][1] - scope-74 >>>>> | | >>>>> | |---part1: Filter[bag] - scope-59 >>>>> | | | >>>>> | | Greater Than[boolean] - scope-63 >>>>> | | | >>>>> | | |---Cast[int] - scope-61 >>>>> | | | | >>>>> | | | |---Project[bytearray][1] - scope-60 >>>>> | | | >>>>> | | |---Constant(11) - scope-62 >>>>> | | >>>>> | |---my_raw: New For Each(false,false,false)[bag] - scope-89 >>>>> | | | >>>>> | | Project[bytearray][0] - scope-86 >>>>> | | | >>>>> | | Project[bytearray][1] - scope-87 >>>>> | | | >>>>> | | Project[bytearray][2] - scope-88 >>>>> | | >>>>> | |---my_raw: >>>>> Load(hdfs://master:54310/user/root/houred-small:PigStorage(' ')) - >>>>> scope-90 >>>>> | >>>>> |---result: Local Rearrange[tuple]{bytearray}(false) - scope-75 >>>>> | | >>>>> | Project[bytearray][1] - scope-76 >>>>> | >>>>> |---part2: Filter[bag] - scope-66 >>>>> | | >>>>> | Less Than[boolean] - scope-70 >>>>> | | >>>>> | |---Cast[int] - scope-68 >>>>> | | | >>>>> | | |---Project[bytearray][1] - scope-67 >>>>> | | >>>>> | |---Constant(13) - scope-69 >>>>> | >>>>> |---my_raw: New For Each(false,false,false)[bag] - scope-94 >>>>> | | >>>>> | Project[bytearray][0] - scope-91 >>>>> | | >>>>> | Project[bytearray][1] - scope-92 >>>>> | | >>>>> | Project[bytearray][2] - scope-93 >>>>> | >>>>> |---my_raw: >>>>> Load(hdfs://master:54310/user/root/houred-small:PigStorage(' ')) - >>>>> scope-95-------- >>>>> Reduce Plan >>>>> result: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-77 >>>>> | >>>>> |---result: Package[tuple]{bytearray} - scope-72-------- >>>>> Global sort: false >>>>> >>>>> >>>>> On Thu, Mar 8, 2012 at 1:14 AM, Dmitriy Ryaboy <[email protected]> >>>> wrote: >>>>> >>>>>> You are confusing map and reduce tasks with a mapreduce jobs. Your >> pig >>>>>> script resulted in a single mapreduce job. The number of map tasks >> was >>>> 2, >>>>>> based on input size -- it has little to do with the actual operators >> you >>>>>> used. >>>>>> >>>>>> There is no union operator involved so I am not sure what you are >>>>>> referring to with that. >>>>>> >>>>>> On Mar 7, 2012, at 8:09 AM, Yongzhi Wang <[email protected] >>> >>>>>> wrote: >>>>>> >>>>>>> Hi, There >>>>>>> >>>>>>> I tried to use the syntax "explain", but the MapReduce plan >> sometime >>>>>>> confused me. >>>>>>> >>>>>>> I tried such syntax below: >>>>>>> >>>>>>> *my_raw = LOAD './houred-small' USING PigStorage('\t') AS >> (user,hour, >>>>>>> query); >>>>>>> part1 = filter my_raw by hour>11; >>>>>>> part2 = filter my_raw by hour<13; >>>>>>> result = cogroup part1 by hour, part2 by hour; >>>>>>> dump result; >>>>>>> explain result;* >>>>>>> >>>>>>> The job stats shows as blow, indicating there are 2 Map tasks and 1 >>>>>> reduce >>>>>>> tasks. But I don't know how does the Map task is mapping to the >>>> MapReduce >>>>>>> plan shown below. It seems each Map task just do one filter and >>>>>> rearrange, >>>>>>> but on which phase the union operation is done? the shuffle phase? >> If >>>> in >>>>>>> that case, two Map tasks actually done different filter work. Is >> that >>>>>>> possible? Or my guess is wrong? >>>>>>> >>>>>>> So, back to the question: *Is there any way that I can see the >> actual >>>> map >>>>>>> and reduce task executed in the pig?* >>>>>>> >>>>>>> *Job Stats (time in seconds): >>>>>>> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime >>>>>>> MaxReduceTime MinReduceTime AvgReduceTime Alias Feature >>>> Outputs >>>>>>> job_201203021230_0038 2 1 3 3 3 12 >>>>>>> 12 1 2 my_raw,part1,part2,result COGROUP >>>>>>> hdfs://master:54310/tmp/temp6260 >>>>>>> 37557/tmp-1661404166, >>>>>>> * >>>>>>> >>>>>>> The mapreduce plan shows as below:* >>>>>>> #-------------------------------------------------- >>>>>>> # Map Reduce Plan >>>>>>> #-------------------------------------------------- >>>>>>> MapReduce node scope-84 >>>>>>> Map Plan >>>>>>> Union[tuple] - scope-85 >>>>>>> | >>>>>>> |---result: Local Rearrange[tuple]{bytearray}(false) - scope-73 >>>>>>> | | | >>>>>>> | | Project[bytearray][1] - scope-74 >>>>>>> | | >>>>>>> | |---part1: Filter[bag] - scope-59 >>>>>>> | | | >>>>>>> | | Greater Than[boolean] - scope-63 >>>>>>> | | | >>>>>>> | | |---Cast[int] - scope-61 >>>>>>> | | | | >>>>>>> | | | |---Project[bytearray][1] - scope-60 >>>>>>> | | | >>>>>>> | | |---Constant(11) - scope-62 >>>>>>> | | >>>>>>> | |---my_raw: New For Each(false,false,false)[bag] - scope-89 >>>>>>> | | | >>>>>>> | | Project[bytearray][0] - scope-86 >>>>>>> | | | >>>>>>> | | Project[bytearray][1] - scope-87 >>>>>>> | | | >>>>>>> | | Project[bytearray][2] - scope-88 >>>>>>> | | >>>>>>> | |---my_raw: >>>>>>> Load(hdfs://master:54310/user/root/houred-small:PigStorage(' >> ')) - >>>>>>> scope-90 >>>>>>> | >>>>>>> |---result: Local Rearrange[tuple]{bytearray}(false) - scope-75 >>>>>>> | | >>>>>>> | Project[bytearray][1] - scope-76 >>>>>>> | >>>>>>> |---part2: Filter[bag] - scope-66 >>>>>>> | | >>>>>>> | Less Than[boolean] - scope-70 >>>>>>> | | >>>>>>> | |---Cast[int] - scope-68 >>>>>>> | | | >>>>>>> | | |---Project[bytearray][1] - scope-67 >>>>>>> | | >>>>>>> | |---Constant(13) - scope-69 >>>>>>> | >>>>>>> |---my_raw: New For Each(false,false,false)[bag] - scope-94 >>>>>>> | | >>>>>>> | Project[bytearray][0] - scope-91 >>>>>>> | | >>>>>>> | Project[bytearray][1] - scope-92 >>>>>>> | | >>>>>>> | Project[bytearray][2] - scope-93 >>>>>>> | >>>>>>> |---my_raw: >>>>>>> Load(hdfs://master:54310/user/root/houred-small:PigStorage(' >> ')) - >>>>>>> scope-95-------- >>>>>>> Reduce Plan >>>>>>> result: Store(fakefile:org.apache.pig.builtin.PigStorage) - >> scope-77 >>>>>>> | >>>>>>> |---result: Package[tuple]{bytearray} - scope-72-------- >>>>>>> Global sort: false >>>>>>> ----------------* >>>>>>> >>>>>>> Thanks! >>>>>> >>>> >>
