Yes, I agree, not introducing new syntax is much more preferable. Doing this optimization automatically for the batch mode is a good idea. For the interactive mode, we would need something like a COMMIT statement, which will force execution (with execution not automatically starting on a STORE command as it currently does).
As regards failure, we could start with our current model, one failure fails everything. Utkarsh > -----Original Message----- > From: Olga Natkovich [mailto:[EMAIL PROTECTED] > Sent: Monday, May 19, 2008 11:23 AM > To: [email protected] > Subject: RE: How Grouping works for multiple groups > > Utkarsh, > > I agree that this issue has been brought up a number of times and needs > to be addressed. I think it would be nice if we could address this > without introducing new syntax for store. In batch mode, this would be > quite easy since we can build execution plan for the entire script > rather than one store at a time. I realize that for interactive and > embedded case it is a bit trickier. Also we need to clarify what are the > semantics of this kind of operation in the presence of failure. If one > store fails, what happens with the rest of the computation? > > Olga > > > -----Original Message----- > > From: Utkarsh Srivastava [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 19, 2008 11:06 AM > > To: [email protected] > > Subject: FW: How Grouping works for multiple groups > > > > Following is an email that showed up on the user-list. I am > > sure most people must have seen it. > > > > The guy wants to scan the data once and do multiple things > > with it. This kind of a need arises often but we don't have a > > very good answer to it. > > > > We have SPLIT, but that is only half the solution (and > > probably not a very good one). > > > > What is needed is more like a multi-store command (I think > > someone has proposed it on one of these lists before). > > > > So you would be able to do things like > > > > A = LOAD ... > > B = FILTER A by .. > > C = FILTER A by .. > > //do something with B > > //do something else with C > > store B,C <===== The new multi-store command > > > > > > Sawzall does better than us in this regard because they have > > collectors to which you can output data, and you can set up > > as many collectors as you want. > > > > Utkarsh > > > > -----Original Message----- > > From: Goel, Ankur [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 19, 2008 1:24 AM > > To: [EMAIL PROTECTED] > > Cc: Holsman, Ian > > Subject: How Grouping works for multiple groups > > > > Hi folks, > > I am new to PIG having a little bit of Hadoop > > Map-reduce experience. I recently had chance to use PIG for > > my data analysis task for which I had written a Map-Red > > program earlier. > > A few questions came up in my mind that I thought would be > > better asked in this forum. Here's a brief description of my > > analysis task to give you an idea of what I am doing. > > > > - For each tuple I need to classify the data into 3 groups - A, B, C. > > > > - For group A and B, I need to aggregate the number of distinct items > > in each group and have them sorted in reverse order in the output. > > > > - For group C, I only need to output those distinct items. > > > > - The output for each of these go to their respective output > > files for e.g. A_file.txt, B_file.txt > > > > > > Now, it seems like in PIG's execution plan each 'Group' > > operation is a separate Map-Reduce job even though its > > happening on the same set of tuples. Whereas writing a > > Map-Red job for the same allows me to prefix a "Group > > identifier" of my choice to the 'key' and produce the > > relevant 'value' data which I then use subsequently in the > > combiner and reducer to perform the other operations and > > output to different files. > > > > If my understanding of PIG is correct then its execution plan > > is spawning multiple Map-Red jobs to scan the same data-set > > again for different groups which is costlier than writing a > > custom Map-red job and packing more work in a single Map-Red > > job the way I mentioned. > > > > I can always reduce the number of groups in my PIG scripts to > > 1 by having a user-defined function generating those group > > prefixes before a group call and then do multiple filters on > > the group 'key' > > again using a user-defined function that does group > > identification but this is less than intuitive and requires > > more user-defined functions than one would like. > > > > My question is , Do current optimization techniques take care > > of such a scenario ? My observation is they don't, but I > > could be wrong here. If they do then how can I have a peek > > into the execution plan to make sure that its not spawning > > more than necessary number of Map-Red jobs. > > > > If they don't, then is it something planned for the future ? > > > > Also, I don't see 'Pig Pen' debugging environment anywhere ? > > Is it still a part of PIG, if yes then how can I use it ? > > > > I know its been a rather long mail, but any help here is > > deeply appreciated as going forward we plan to use PIG > > heavily to avoid writing custom Map-Red jobs for every > > different kind of analysis that we intend to do. > > > > Thanks and Regards > > -Ankur > >
