I agree with your sentiment, Thejas. Perhaps rather than calling it a new
group type, we can introduce a keyword that can be used in multiple places?
Something like:

c = group x by id using 'collected' __nosafety

(double underscores call attention to the fact that this is a keyword, and a
super-user feature at that)

Then we can use the same keyword to turn off checking for merge joins, etc,
on a per-call basis.

D

On Fri, Oct 7, 2011 at 4:29 PM, Thejas Nair <[email protected]> wrote:

> I would vote for option C - i would like the user to sign off in each place
> the feature is used.
>
> pig scripts will be modified over time, and person making the edit might
> not notice that the checks are turned off elsewhere in the script. If it is
> set in a properties file, it could get inadvertently used. I think dealing
> with incorrect results is too expensive, and justifies this.
>
> -Thejas
>
>
>
> On 10/7/11 8:23 AM, Alan Gates wrote:
>
>> I would vote for Dmitriy's original option b, on a per feature basis.  I
>> know per feature switches are more cumbersome, but a "turn off all sanity
>> checks" option is dangerous.  When removing safeties it seems better to do
>> it one at a time.
>>
>> Alan.
>>
>> On Oct 6, 2011, at 10:50 PM, Dmitriy Ryaboy wrote:
>>
>>  Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which
>>> is
>>> totally backwards, since if you are a dummy, the last thing you will do
>>> is
>>> use a little-known parameter to protect yourself... but I digress.
>>>
>>> Being able to set safety valves per-script seems like a good idea. Make
>>> it
>>> global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
>>> etc?)
>>>
>>> D
>>>
>>> On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan<[email protected]>*
>>> *wrote:
>>>
>>>  One possibility is to introduce 'mode' in Pig with default value of
>>>> 'strict'. Other values being 'non-strict' or potentially others. Another
>>>> use
>>>> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
>>>> PigStorage cannot guarantee all the requirements imposed by Merge Join,
>>>> but
>>>> you can still use it in most cases. I dont recall all the details but
>>>> discussion can be found at: https://issues.apache.org/**
>>>> jira/browse/PIG-1518 <https://issues.apache.org/jira/browse/PIG-1518>
>>>>
>>>> Ashutosh
>>>> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy<[email protected]>
>>>>  wrote:
>>>>
>>>>  Hi guys,
>>>>> It seems like our 'collected' option for group is pretty limited.
>>>>> Imagine I have the following (silly example) script:
>>>>>
>>>>> tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
>>>>> text:chararray, ts:long);
>>>>> happy_words = load 'happy_words' using HappyLoader() as
>>>>> (word:chararray);
>>>>>
>>>>> ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
>>>>> (ngram:chararray);
>>>>>
>>>>> -- get only happy ngrams, using replicated to avoid MR step
>>>>> happy_ngrams = join ngrams by ngram, happy_words by word using
>>>>> 'replicated';
>>>>>
>>>>> -- find only happy tweets. We know ngrams that were exploded from a
>>>>>
>>>> single
>>>>
>>>>> tweet
>>>>> -- must be in the same mapper still, so in theory this should work
>>>>> happy_tweets = group happy_ngrams by (id, uid) using 'collected';
>>>>>
>>>>>
>>>>> But this doesn't work, of course, because there's a whole mess of
>>>>>
>>>> operators
>>>>
>>>>> between the load and the group, including a join, and nothing makes any
>>>>> guarantees about (id, uid) being on the same mapper except for what the
>>>>> user
>>>>> knows about the data.
>>>>>
>>>>> What's the right approach to let the user force this through?
>>>>> a) this is an edge case optimization that's more trouble than it is
>>>>> worth
>>>>> b) something like "set pig.i.know.what.i.am.doing.**collectedgroup=true
>>>>> to
>>>>> disable sanity checks
>>>>> c) using 'collected-its-cool-dmitriy-**said-its-ok'
>>>>> d) drop the checks altogether
>>>>> e) something else?
>>>>>
>>>>> D
>>>>>
>>>>>
>>>>
>>
>

Reply via email to