I would vote for Dmitriy's original option b, on a per feature basis. I know per feature switches are more cumbersome, but a "turn off all sanity checks" option is dangerous. When removing safeties it seems better to do it one at a time.
Alan. On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote: > Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is > totally backwards, since if you are a dummy, the last thing you will do is > use a little-known parameter to protect yourself... but I digress. > > Being able to set safety valves per-script seems like a good idea. Make it > global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin, > etc?) > > D > > On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <[email protected]>wrote: > >> One possibility is to introduce 'mode' in Pig with default value of >> 'strict'. Other values being 'non-strict' or potentially others. Another >> use >> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently >> PigStorage cannot guarantee all the requirements imposed by Merge Join, but >> you can still use it in most cases. I dont recall all the details but >> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518 >> >> Ashutosh >> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[email protected]> wrote: >> >>> Hi guys, >>> It seems like our 'collected' option for group is pretty limited. >>> Imagine I have the following (silly example) script: >>> >>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long, >>> text:chararray, ts:long); >>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray); >>> >>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as >>> (ngram:chararray); >>> >>> -- get only happy ngrams, using replicated to avoid MR step >>> happy_ngrams = join ngrams by ngram, happy_words by word using >>> 'replicated'; >>> >>> -- find only happy tweets. We know ngrams that were exploded from a >> single >>> tweet >>> -- must be in the same mapper still, so in theory this should work >>> happy_tweets = group happy_ngrams by (id, uid) using 'collected'; >>> >>> >>> But this doesn't work, of course, because there's a whole mess of >> operators >>> between the load and the group, including a join, and nothing makes any >>> guarantees about (id, uid) being on the same mapper except for what the >>> user >>> knows about the data. >>> >>> What's the right approach to let the user force this through? >>> a) this is an edge case optimization that's more trouble than it is worth >>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to >>> disable sanity checks >>> c) using 'collected-its-cool-dmitriy-said-its-ok' >>> d) drop the checks altogether >>> e) something else? >>> >>> D >>> >>
