I would vote for option C - i would like the user to sign off in each place the feature is used.

pig scripts will be modified over time, and person making the edit might not notice that the checks are turned off elsewhere in the script. If it is set in a properties file, it could get inadvertently used. I think dealing with incorrect results is too expensive, and justifies this.

-Thejas


On 10/7/11 8:23 AM, Alan Gates wrote:
I would vote for Dmitriy's original option b, on a per feature basis.  I know per feature 
switches are more cumbersome, but a "turn off all sanity checks" option is 
dangerous.  When removing safeties it seems better to do it one at a time.

Alan.

On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:

Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
totally backwards, since if you are a dummy, the last thing you will do is
use a little-known parameter to protect yourself... but I digress.

Being able to set safety valves per-script seems like a good idea. Make it
global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
etc?)

D

On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<[email protected]>wrote:

One possibility is to introduce 'mode' in Pig with default value of
'strict'. Other values being 'non-strict' or potentially others. Another
use
case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
PigStorage cannot guarantee all the requirements imposed by Merge Join, but
you can still use it in most cases. I dont recall all the details but
discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518

Ashutosh
On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<[email protected]>  wrote:

Hi guys,
It seems like our 'collected' option for group is pretty limited.
Imagine I have the following (silly example) script:

tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
text:chararray, ts:long);
happy_words = load 'happy_words' using HappyLoader() as (word:chararray);

ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
(ngram:chararray);

-- get only happy ngrams, using replicated to avoid MR step
happy_ngrams = join ngrams by ngram, happy_words by word using
'replicated';

-- find only happy tweets. We know ngrams that were exploded from a
single
tweet
-- must be in the same mapper still, so in theory this should work
happy_tweets = group happy_ngrams by (id, uid) using 'collected';


But this doesn't work, of course, because there's a whole mess of
operators
between the load and the group, including a join, and nothing makes any
guarantees about (id, uid) being on the same mapper except for what the
user
knows about the data.

What's the right approach to let the user force this through?
a) this is an edge case optimization that's more trouble than it is worth
b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
disable sanity checks
c) using 'collected-its-cool-dmitriy-said-its-ok'
d) drop the checks altogether
e) something else?

D




Reply via email to