Re: question about UDF optimization

Jacques Nadeau Tue, 21 Jul 2015 16:28:08 -0700

I don't think so.  There are something like 1500 functions where this isn't
true (default) and one or two where it is.


On Tue, Jul 21, 2015 at 4:25 PM, Daniel Barclay <[email protected]>
wrote:

>
> Should Drill be defaulting the other way?
>
> That is, instead of assuming pure unless declared otherwise (leading to
> wrong results in the case that the assumption is wrong (or the annotation
> was forgotten)), should Drill be assuming not pure unless declared pure
> (leading to only lower performance in the wrong-assumption case)?
>
> Daniel
>
>
>
>
> Jacques Nadeau wrote:
>
>> There is an annotation on the function template.  I don't have a laptop
>> close but I believe it is something similar to isRandom. It basically
>> tells
>> Drill that this is a nondeterministic function. I will be more specific
>> once I get back to my machine if you don't find it sooner.
>>
>> Jacques
>> *Summary:*
>>
>> Drill is very aggressive about optimizing away calls to functions with
>> constant arguments. I worry that could extend to per record batch
>> optimization if I accidentally have constant values and even if that
>> doesn't happen, it is a pain in the ass now largely because Drill is
>> clever
>> enough to see through my attempt to hide the constant nature of my
>> parameters.
>>
>> *Question:*
>>
>> Is there a way to mark a UDF as not being a pure function?
>>
>> *Details:*
>>
>> I have written a UDF to generate a random number.  It takes parameters
>> that
>> define the distribution.  All seems well and good.
>>
>> I find, however, that the function is only called once (twice, actually
>> apparently due to pipeline warmup) and then Drill optimizes away later
>> calls, apparently because the parameters to the function are constant and
>> Drill thinks my function is a pure function.  If I make up some bogus data
>> to pass in as a parameter, all is well and the function is called as much
>> as I wanted.
>>
>> For instance, with the uniform distribution, my function takes two
>> arguments, those being the minimum and maximum value to return.  Here is
>> what I see with constants for the min and max:
>>
>> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as
>> tbl(x);
>> into eval
>> into eval
>> +---------------------+
>> |       EXPR$0        |
>> +---------------------+
>> | 1.7787372583008298  |
>> | 1.7787372583008298  |
>> | 1.7787372583008298  |
>> | 1.7787372583008298  |
>> +---------------------+
>>
>>
>> If I include an actual value, we see more interesting behavior even if the
>> value is effectively constant:
>>
>> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as
>> tbl(x);
>> into eval
>> into eval
>> into eval
>> into eval
>> +----------------------+
>> |        EXPR$0        |
>> +----------------------+
>> | 3.688377805419459    |
>> | 0.2827056410711032   |
>> | 2.3107479622644918   |
>> | 0.10813788169218574  |
>> +----------------------+
>> 4 rows selected (0.088 seconds)
>>
>>
>> Even if I make the max value come along from the sub-query, I get the evil
>> behavior although the function is now surprisingly actually called three
>> times, apparently to do with warming up the pipeline:
>>
>> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
>> max_value,x from (values 5,5,5,5) as tbl(x)) foo;
>> into eval
>> into eval
>> into eval
>> +---------------------+
>> |       EXPR$0        |
>> +---------------------+
>> | 13.404462063773702  |
>> | 13.404462063773702  |
>> | 13.404462063773702  |
>> | 13.404462063773702  |
>> +---------------------+
>> 4 rows selected (0.121 seconds)
>>
>> The UDF itself is boring and can be found at
>> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0
>>
>> So how can I defeat this behavior?
>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>

Re: question about UDF optimization

Reply via email to