Please file a ticket and cc me. Thanks.
On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim <m...@palantir.com> wrote: > That sounds good. I think the optimizer should not change the behavior of > execution and reordering the filters can easily result in errors as > exemplified below. I agree that the optimizer should not reorder the > filters for correctness. Please correct me if I have an incorrect > assumption about the guarantees of the optimizer. > > Is there a bug filed that tracks the change you suggested below, btw? I’d > like to follow the issue, if there’s one. > > Thanks, > Mingyu > > From: Reynold Xin > Date: Wednesday, September 16, 2015 at 1:17 PM > To: Zack Sampson > Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael > Armbrust > > Subject: Re: And.eval short circuiting > > This is "expected" in the sense that DataFrame operations can get > re-ordered under the hood by the optimizer. For example, if the optimizer > deems it is cheaper to apply the 2nd filter first, it might re-arrange the > filters. In reality, it doesn't do that. I think this is too confusing and > violates principle of least astonishment, so we should fix it. > > I discussed more with Michael offline, and think we can add a rule for the > physical filter operator to replace the general AND/OR/equality/etc with a > special version that treats null as false. This rule needs to be carefully > written because it should only apply to subtrees of AND/OR/equality/etc > (e.g. it shouldn't rewrite children of isnull). > > > On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com> > wrote: > >> I see. We're having problems with code like this (forgive my noob scala): >> >> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements") >> df >> .filter($"animals".rlike(".*")) >> .filter(callUDF({(value: String) => value.length > 2}, BooleanType, >> $"animals")) >> .collect() >> >> This code throws a NPE because: >> * Catalyst combines the filters with an AND >> * the first filter passes returns null on the first input >> * the second filter tries to read the length of that null >> >> This feels weird. Reading that code, I wouldn't expect null to be passed >> to the second filter. Even weirder is that if you call collect() after the >> first filter you won't see nulls, and if you write the data to disk and >> reread it, the NPE won't happen. >> >> It's bewildering! Is this the intended behavior? >> ------------------------------ >> *From:* Reynold Xin [r...@databricks.com] >> *Sent:* Monday, September 14, 2015 10:14 PM >> *To:* Zack Sampson >> *Cc:* dev@spark.apache.org >> *Subject:* Re: And.eval short circuiting >> >> rxin=# select null and true; >> ?column? >> ---------- >> >> (1 row) >> >> rxin=# select null and false; >> ?column? >> ---------- >> f >> (1 row) >> >> >> null and false should return false. >> >> >> On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com> >> wrote: >> >>> It seems like And.eval can avoid calculating right.eval if left.eval >>> returns null. Is there a reason it's written like it is? >>> >>> override def eval(input: Row): Any = { >>> val l = left.eval(input) >>> if (l == false) { >>> false >>> } else { >>> val r = right.eval(input) >>> if (r == false) { >>> false >>> } else { >>> if (l != null && r != null) { >>> true >>> } else { >>> null >>> } >>> } >>> } >>> } >>> >>> >> >