The JIRA https://issues.apache.org/jira/browse/PIG-514 has brought up an interesting issue of how we handle empty bags in foreach statements. The current pig semantic for foreach is that it always produces a cross produce of all of the fields in its projection list. So:

B = foreach A generate $0, $1;

technically produces a cross product of $0 and $1. Since both $0 and $1 are (generally) single valued this produces one row. In cases where they are multi-valued (generate flatten($0), $1) then the cross product produces multiple rows. In cases where any of the elements are an empty bag, the cross product produces no row. That is, emptyness is equivalent to a 0 in multiplication, it swallows everything.

Because of this, pig is currently implemented such that as soon as it sees an empty bag in an output it stops, because there's no point in continuing. So, scripts like:

A = load 'myfile';
B = group A by $0;
C = foreach B generate {
    D = filter A by $1 > 5;
    E = filter D by $1 < 5;
    generate COUNT(E.$0), group;
}

will generate no output all. It would be reasonable to expect that the above would produce a list of all entries from the first field of 'myfile', along with a 0 (for the count).

A couple of questions about this:

1) Should we keep this empty bag as a blackhole semantic? It strikes me as reasonable that instead of being a blackhole it instead produces a NULL value. This would make outer joins somewhat easier to do. I'm not sure what other side effects it would have.

2) If we do keep the blackhole semantic, should UDFs get a chance to evaluate an empty bag? The current implementation certainly seems to violate the law of least astonishment. However, if we extend this to UDFs we need to think carefully about what else it needs extended to. In particular, the semantic for streaming is that if we have no data, we will not envoke the external binary. It seems we should be consistent throughout. Any empty bag should either mean that we stop processing there and return nothing, or that we allow user provided code a chance to run, even without input.

Thoughts?  Insights?

Alan.

Reply via email to