I don't think so. There are something like 1500 functions where this isn't true (default) and one or two where it is.
On Tue, Jul 21, 2015 at 4:25 PM, Daniel Barclay <[email protected]> wrote: > > Should Drill be defaulting the other way? > > That is, instead of assuming pure unless declared otherwise (leading to > wrong results in the case that the assumption is wrong (or the annotation > was forgotten)), should Drill be assuming not pure unless declared pure > (leading to only lower performance in the wrong-assumption case)? > > Daniel > > > > > Jacques Nadeau wrote: > >> There is an annotation on the function template. I don't have a laptop >> close but I believe it is something similar to isRandom. It basically >> tells >> Drill that this is a nondeterministic function. I will be more specific >> once I get back to my machine if you don't find it sooner. >> >> Jacques >> *Summary:* >> >> Drill is very aggressive about optimizing away calls to functions with >> constant arguments. I worry that could extend to per record batch >> optimization if I accidentally have constant values and even if that >> doesn't happen, it is a pain in the ass now largely because Drill is >> clever >> enough to see through my attempt to hide the constant nature of my >> parameters. >> >> *Question:* >> >> Is there a way to mark a UDF as not being a pure function? >> >> *Details:* >> >> I have written a UDF to generate a random number. It takes parameters >> that >> define the distribution. All seems well and good. >> >> I find, however, that the function is only called once (twice, actually >> apparently due to pipeline warmup) and then Drill optimizes away later >> calls, apparently because the parameters to the function are constant and >> Drill thinks my function is a pure function. If I make up some bogus data >> to pass in as a parameter, all is well and the function is called as much >> as I wanted. >> >> For instance, with the uniform distribution, my function takes two >> arguments, those being the minimum and maximum value to return. Here is >> what I see with constants for the min and max: >> >> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as >> tbl(x); >> into eval >> into eval >> +---------------------+ >> | EXPR$0 | >> +---------------------+ >> | 1.7787372583008298 | >> | 1.7787372583008298 | >> | 1.7787372583008298 | >> | 1.7787372583008298 | >> +---------------------+ >> >> >> If I include an actual value, we see more interesting behavior even if the >> value is effectively constant: >> >> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as >> tbl(x); >> into eval >> into eval >> into eval >> into eval >> +----------------------+ >> | EXPR$0 | >> +----------------------+ >> | 3.688377805419459 | >> | 0.2827056410711032 | >> | 2.3107479622644918 | >> | 0.10813788169218574 | >> +----------------------+ >> 4 rows selected (0.088 seconds) >> >> >> Even if I make the max value come along from the sub-query, I get the evil >> behavior although the function is now surprisingly actually called three >> times, apparently to do with warming up the pipeline: >> >> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as >> max_value,x from (values 5,5,5,5) as tbl(x)) foo; >> into eval >> into eval >> into eval >> +---------------------+ >> | EXPR$0 | >> +---------------------+ >> | 13.404462063773702 | >> | 13.404462063773702 | >> | 13.404462063773702 | >> | 13.404462063773702 | >> +---------------------+ >> 4 rows selected (0.121 seconds) >> >> The UDF itself is boring and can be found at >> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0 >> >> So how can I defeat this behavior? >> >> > > -- > Daniel Barclay > MapR Technologies >
