I agree with your sentiment, Thejas. Perhaps rather than calling it a new group type, we can introduce a keyword that can be used in multiple places? Something like:
c = group x by id using 'collected' __nosafety (double underscores call attention to the fact that this is a keyword, and a super-user feature at that) Then we can use the same keyword to turn off checking for merge joins, etc, on a per-call basis. D On Fri, Oct 7, 2011 at 4:29 PM, Thejas Nair <[email protected]> wrote: > I would vote for option C - i would like the user to sign off in each place > the feature is used. > > pig scripts will be modified over time, and person making the edit might > not notice that the checks are turned off elsewhere in the script. If it is > set in a properties file, it could get inadvertently used. I think dealing > with incorrect results is too expensive, and justifies this. > > -Thejas > > > > On 10/7/11 8:23 AM, Alan Gates wrote: > >> I would vote for Dmitriy's original option b, on a per feature basis. I >> know per feature switches are more cumbersome, but a "turn off all sanity >> checks" option is dangerous. When removing safeties it seems better to do >> it one at a time. >> >> Alan. >> >> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote: >> >> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which >>> is >>> totally backwards, since if you are a dummy, the last thing you will do >>> is >>> use a little-known parameter to protect yourself... but I digress. >>> >>> Being able to set safety valves per-script seems like a good idea. Make >>> it >>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin, >>> etc?) >>> >>> D >>> >>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<[email protected]>* >>> *wrote: >>> >>> One possibility is to introduce 'mode' in Pig with default value of >>>> 'strict'. Other values being 'non-strict' or potentially others. Another >>>> use >>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently >>>> PigStorage cannot guarantee all the requirements imposed by Merge Join, >>>> but >>>> you can still use it in most cases. I dont recall all the details but >>>> discussion can be found at: https://issues.apache.org/** >>>> jira/browse/PIG-1518 <https://issues.apache.org/jira/browse/PIG-1518> >>>> >>>> Ashutosh >>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<[email protected]> >>>> wrote: >>>> >>>> Hi guys, >>>>> It seems like our 'collected' option for group is pretty limited. >>>>> Imagine I have the following (silly example) script: >>>>> >>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long, >>>>> text:chararray, ts:long); >>>>> happy_words = load 'happy_words' using HappyLoader() as >>>>> (word:chararray); >>>>> >>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as >>>>> (ngram:chararray); >>>>> >>>>> -- get only happy ngrams, using replicated to avoid MR step >>>>> happy_ngrams = join ngrams by ngram, happy_words by word using >>>>> 'replicated'; >>>>> >>>>> -- find only happy tweets. We know ngrams that were exploded from a >>>>> >>>> single >>>> >>>>> tweet >>>>> -- must be in the same mapper still, so in theory this should work >>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected'; >>>>> >>>>> >>>>> But this doesn't work, of course, because there's a whole mess of >>>>> >>>> operators >>>> >>>>> between the load and the group, including a join, and nothing makes any >>>>> guarantees about (id, uid) being on the same mapper except for what the >>>>> user >>>>> knows about the data. >>>>> >>>>> What's the right approach to let the user force this through? >>>>> a) this is an edge case optimization that's more trouble than it is >>>>> worth >>>>> b) something like "set pig.i.know.what.i.am.doing.**collectedgroup=true >>>>> to >>>>> disable sanity checks >>>>> c) using 'collected-its-cool-dmitriy-**said-its-ok' >>>>> d) drop the checks altogether >>>>> e) something else? >>>>> >>>>> D >>>>> >>>>> >>>> >> >
