Little-known fact: MySQL actually has an --i-am-a-dummy parameter. Which is
totally backwards, since if you are a dummy, the last thing you will do is
use a little-known parameter to protect yourself... but I digress.

Being able to set safety valves per-script seems like a good idea. Make it
global, or per-feature? (pig.strict.collectedgroup, pig.strict.mergejoin,
etc?)

D

On Thu, Oct 6, 2011 at 10:21 AM, Ashutosh Chauhan <[email protected]>wrote:

> One possibility is to introduce 'mode' in Pig with default value of
> 'strict'. Other values being 'non-strict' or potentially others. Another
> use
> case for 'non-strict' mode is PigStorage usage in Merge Join. Inherently
> PigStorage cannot guarantee all the requirements imposed by Merge Join, but
> you can still use it in most cases. I dont recall all the details but
> discussion can be found at: https://issues.apache.org/jira/browse/PIG-1518
>
> Ashutosh
> On Thu, Oct 6, 2011 at 08:50, Dmitriy Ryaboy <[email protected]> wrote:
>
> > Hi guys,
> > It seems like our 'collected' option for group is pretty limited.
> > Imagine I have the following (silly example) script:
> >
> > tweets = load 'tweets' using TweetLoader() as (id:long, uid:long,
> > text:chararray, ts:long);
> > happy_words = load 'happy_words' using HappyLoader() as (word:chararray);
> >
> > ngrams = foreach tweets generate id, uid, ts, FLATTEN(NGRAM(text)) as
> > (ngram:chararray);
> >
> > -- get only happy ngrams, using replicated to avoid MR step
> > happy_ngrams = join ngrams by ngram, happy_words by word using
> > 'replicated';
> >
> > -- find only happy tweets. We know ngrams that were exploded from a
> single
> > tweet
> > -- must be in the same mapper still, so in theory this should work
> > happy_tweets = group happy_ngrams by (id, uid) using 'collected';
> >
> >
> > But this doesn't work, of course, because there's a whole mess of
> operators
> > between the load and the group, including a join, and nothing makes any
> > guarantees about (id, uid) being on the same mapper except for what the
> > user
> > knows about the data.
> >
> > What's the right approach to let the user force this through?
> > a) this is an edge case optimization that's more trouble than it is worth
> > b) something like "set pig.i.know.what.i.am.doing.collectedgroup=true to
> > disable sanity checks
> > c) using 'collected-its-cool-dmitriy-said-its-ok'
> > d) drop the checks altogether
> > e) something else?
> >
> > D
> >
>

Reply via email to