Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is totally backwards, since if you are a dummy, the last thing you will do is use a little-known parameter to protect yourself... but I digress.
Being able to set safety valves per-script seems like a good idea. Make it global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin, etc?) D On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <[email protected]>wrote: > One possibility is to introduce 'mode' in Pig with default value of > 'strict'. Other values being 'non-strict' or potentially others. Another > use > case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently > PigStorage cannot guarantee all the requirements imposed by Merge Join, but > you can still use it in most cases. I dont recall all the details but > discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518 > > Ashutosh > On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[email protected]> wrote: > > > Hi guys, > > It seems like our 'collected' option for group is pretty limited. > > Imagine I have the following (silly example) script: > > > > tweets = load 'tweets' using TweetLoader() as (id:long, uid:long, > > text:chararray, ts:long); > > happy_words = load 'happy_words' using HappyLoader() as (word:chararray); > > > > ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as > > (ngram:chararray); > > > > -- get only happy ngrams, using replicated to avoid MR step > > happy_ngrams = join ngrams by ngram, happy_words by word using > > 'replicated'; > > > > -- find only happy tweets. We know ngrams that were exploded from a > single > > tweet > > -- must be in the same mapper still, so in theory this should work > > happy_tweets = group happy_ngrams by (id, uid) using 'collected'; > > > > > > But this doesn't work, of course, because there's a whole mess of > operators > > between the load and the group, including a join, and nothing makes any > > guarantees about (id, uid) being on the same mapper except for what the > > user > > knows about the data. > > > > What's the right approach to let the user force this through? > > a) this is an edge case optimization that's more trouble than it is worth > > b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to > > disable sanity checks > > c) using 'collected-its-cool-dmitriy-said-its-ok' > > d) drop the checks altogether > > e) something else? > > > > D > > >
