Yep, I would expect pure to be the majority and default, and that makes sense, because these functions are not class members that could have a (implicit) "this" pointer that references member variables whose state would change, leading to implementations with side-effects (and impure results).
On Tue, Jul 21, 2015 at 5:25 PM, Ted Dunning <[email protected]> wrote: > Even in my own warped experience, the vast majority of UDF's I have written > or considered writing have been pure. > > > > On Tue, Jul 21, 2015 at 4:27 PM, Jacques Nadeau <[email protected]> > wrote: > > > I don't think so. There are something like 1500 functions where this > isn't > > true (default) and one or two where it is. > > > > On Tue, Jul 21, 2015 at 4:25 PM, Daniel Barclay <[email protected]> > > wrote: > > > > > > > > Should Drill be defaulting the other way? > > > > > > That is, instead of assuming pure unless declared otherwise (leading to > > > wrong results in the case that the assumption is wrong (or the > annotation > > > was forgotten)), should Drill be assuming not pure unless declared pure > > > (leading to only lower performance in the wrong-assumption case)? > > > > > > Daniel > > > > > > > > > > > > > > > Jacques Nadeau wrote: > > > > > >> There is an annotation on the function template. I don't have a > laptop > > >> close but I believe it is something similar to isRandom. It basically > > >> tells > > >> Drill that this is a nondeterministic function. I will be more > specific > > >> once I get back to my machine if you don't find it sooner. > > >> > > >> Jacques > > >> *Summary:* > > >> > > >> Drill is very aggressive about optimizing away calls to functions with > > >> constant arguments. I worry that could extend to per record batch > > >> optimization if I accidentally have constant values and even if that > > >> doesn't happen, it is a pain in the ass now largely because Drill is > > >> clever > > >> enough to see through my attempt to hide the constant nature of my > > >> parameters. > > >> > > >> *Question:* > > >> > > >> Is there a way to mark a UDF as not being a pure function? > > >> > > >> *Details:* > > >> > > >> I have written a UDF to generate a random number. It takes parameters > > >> that > > >> define the distribution. All seems well and good. > > >> > > >> I find, however, that the function is only called once (twice, > actually > > >> apparently due to pipeline warmup) and then Drill optimizes away later > > >> calls, apparently because the parameters to the function are constant > > and > > >> Drill thinks my function is a pure function. If I make up some bogus > > data > > >> to pass in as a parameter, all is well and the function is called as > > much > > >> as I wanted. > > >> > > >> For instance, with the uniform distribution, my function takes two > > >> arguments, those being the minimum and maximum value to return. Here > is > > >> what I see with constants for the min and max: > > >> > > >> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as > > >> tbl(x); > > >> into eval > > >> into eval > > >> +---------------------+ > > >> | EXPR$0 | > > >> +---------------------+ > > >> | 1.7787372583008298 | > > >> | 1.7787372583008298 | > > >> | 1.7787372583008298 | > > >> | 1.7787372583008298 | > > >> +---------------------+ > > >> > > >> > > >> If I include an actual value, we see more interesting behavior even if > > the > > >> value is effectively constant: > > >> > > >> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as > > >> tbl(x); > > >> into eval > > >> into eval > > >> into eval > > >> into eval > > >> +----------------------+ > > >> | EXPR$0 | > > >> +----------------------+ > > >> | 3.688377805419459 | > > >> | 0.2827056410711032 | > > >> | 2.3107479622644918 | > > >> | 0.10813788169218574 | > > >> +----------------------+ > > >> 4 rows selected (0.088 seconds) > > >> > > >> > > >> Even if I make the max value come along from the sub-query, I get the > > evil > > >> behavior although the function is now surprisingly actually called > three > > >> times, apparently to do with warming up the pipeline: > > >> > > >> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as > > >> max_value,x from (values 5,5,5,5) as tbl(x)) foo; > > >> into eval > > >> into eval > > >> into eval > > >> +---------------------+ > > >> | EXPR$0 | > > >> +---------------------+ > > >> | 13.404462063773702 | > > >> | 13.404462063773702 | > > >> | 13.404462063773702 | > > >> | 13.404462063773702 | > > >> +---------------------+ > > >> 4 rows selected (0.121 seconds) > > >> > > >> The UDF itself is boring and can be found at > > >> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0 > > >> > > >> So how can I defeat this behavior? > > >> > > >> > > > > > > -- > > > Daniel Barclay > > > MapR Technologies > > > > > >
