I would vote for Dmitriy's original option b, on a per feature basis.  I know 
per feature switches are more cumbersome, but a "turn off all sanity checks" 
option is dangerous.  When removing safeties it seems better to do it one at a 
time.

Alan.

On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:

> Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
> totally backwards, since if you are a dummy, the last thing you will do is
> use a little-known parameter to protect yourself... but I digress.
> 
> Being able to set safety valves per-script seems like a good idea. Make it
> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
> etc?)
> 
> D
> 
> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <[email protected]>wrote:
> 
>> One possibility is to introduce 'mode' in Pig with default value of
>> 'strict'. Other values being 'non-strict' or potentially others. Another
>> use
>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
>> you can still use it in most cases. I dont recall all the details but
>> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>> 
>> Ashutosh
>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[email protected]> wrote:
>> 
>>> Hi guys,
>>> It seems like our 'collected' option for group is pretty limited.
>>> Imagine I have the following (silly example) script:
>>> 
>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>> text:chararray, ts:long);
>>> happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
>>> 
>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>> (ngram:chararray);
>>> 
>>> -- get only happy ngrams, using replicated to avoid MR step
>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>> 'replicated';
>>> 
>>> -- find only happy tweets. We know ngrams that were exploded from a
>> single
>>> tweet
>>> -- must be in the same mapper still, so in theory this should work
>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>> 
>>> 
>>> But this doesn't work, of course, because there's a whole mess of
>> operators
>>> between the load and the group, including a join, and nothing makes any
>>> guarantees about (id, uid) being on the same mapper except for what the
>>> user
>>> knows about the data.
>>> 
>>> What's the right approach to let the user force this through?
>>> a) this is an edge case optimization that's more trouble than it is worth
>>> b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
>>> disable sanity checks
>>> c) using 'collected-its-cool-dmitriy-said-its-ok'
>>> d) drop the checks altogether
>>> e) something else?
>>> 
>>> D
>>> 
>> 

Reply via email to