Hi Paul, yes, the any_value function will need to have separate generated code for different data types and mode combinations. I believe Gautam has implemented all the scalar types (through the standard template mechanism that Drill follows). The complex types are the ones that are harder.
> Moreover, the code generator must understand that code generated for a Map UDAF must be different than that for a scalar UDAF. Presumably we must have that code, since the UDF mechanism supports maps. Yes, I assume you are referring to the decision point here [1]. There is some overlap with what we do with MAPPIFY/KVGEN function which occurs as part of the Project operator. ProjectRecordBatch generates code for the functions that require ComplexWriter. The MAPPIFY function reads data using a FieldReader [2] and outputs data using a ComplexWriter. However, there are differences with how ANY_VALUE operates particularly because we want to treat it as an Aggregate function. For instance, a ValueReference in a ComplexWriter is always marked as LATE binding type [3] whereas for ANY_VALUE we want it to reflect the input type. Code generation for either StreamingAgg or HashAgg does not like LATE type. So, this is a new requirement which potentially needs some changes to ValueReference. Regarding repeated maps/arrays, let me discuss with Gautam about the details and will provide an update. For Hash Agg versus Streaming Agg, I have some thoughts that I will send out in a follow-up email. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java#L108 [2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java#L55 [3] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/ValueReference.java#L76 -Aman On Fri, Apr 13, 2018 at 10:31 AM, Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Aman & Gautam, > > FWIW, here is my understanding of UDAF functions based on a write-up I did > a few months back. [1] > > All Drill functions are implemented using the UDF and UDAF protocol. (The > "U" (User) is a bit of a misnomer, internal and user-defined functions > follow the same protocol.) Every UDF (including UDAF) is strongly typed in > its arguments and return value. To create the ANY_VALUE implementation, you > must create a separate implementation for each and every combination of > (type, mode). That is, we need a REQUIRED INT and OPTIONAL INT > implementation for INT types. > > In this case, the incoming argument is passed using the argument > annotation, and the return value via the return annotation. The generated > code sets the incoming argument and copies the return value from the return > variable. (There is an example of the generated code in the write-up.) > > For a Map, there is no annotation to say "set this value to a map" for > either input or output. Instead, we pass in a complex reader for input (I > believe) and a complex writer for output. (Here I am a bit hazy as I never > had time to experiment with maps and arrays in UDFs.) > > So, you'll need a Map implementation. (Maps can only be REQUIRED, never > OPTIONAL, unless they are in a UNION or LIST...) > > Moreover, the code generator must understand that code generated for a Map > UDAF must be different than that for a scalar UDAF. Presumably we must have > that code, since the UDF mechanism supports maps. > > Have you worked out how to handle arrays (REPEATED cardinality?) It was > not clear from my exploration of UDFs how we handle REPEATED types. The > UDAF must pass in one array, which the UDAF copies to its output, which is > then written to the output repeated vector. Since values must arrive in > Holders, it is not clear how this would be done for arrays. Perhaps there > is an annotation that lets us use some form of complex writer for arrays as > is done for MAPs? Again, sorry, I didn't have time to learn that bit. Would > be great to understand that so we can add it to the write-up. > > This chain mentions a MAP type. Drill also includes other complex types: > REPEATED MAP, (non-repeated) LIST, (repeated) LIST, and UNION. It is not at > all clear how UDAFs work for these types. > > One other thing to consider: ANY_VALUE can never work for the hash agg > because output values are updated in random order. It can only ever work > for streaming agg because the streaming agg only appends output values. > Fortunately, this chain is about the streaming agg. For Hash Agg, > intermediate variable-width values are stored in an Object Vector, but > those values won't survive serialization. As a result, only fixed-width > types can be updated in random order. DRILL-6087 describes this issue. > > Thanks, > - Paul > > [1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information > > > > > > > On Wednesday, April 11, 2018, 4:09:47 PM PDT, Aman Sinha < > amansi...@apache.org> wrote: > > Here's some background on what Gautam is trying to do: Currently, SQL > does > not have a standard way to do a DISTINCT on a subset of the columns in the > SELECT list. Suppose there are 2 columns: > a: INTEGER > b: MAP > Suppose I want to only do DISTINCT on 'a' and I don't really care about the > column 'b' .. I just want the first or any value of 'b' within a single > group of 'a'. Postgres actually has a 'DISTINCT ON(a), b' syntax but > based on our discussion on the Calcite mailing list, we want to avoid that > syntax. So, there's an alternative proposal to do the following: > > SELECT a, ANY_VALUE(b) FROM table GROUP BY a > > This means, ANY_VALUE will essentially be treated as an Aggregate function > and from a code-gen perspective, we want to read 1 item (a MapHolder) from > the incoming MapVector and write it to a particular index in the output > MapVector. This is where it would be useful to have > MapVector.setSafe() since the StreamingAgg and HashAgg both generate > setSafe() for normal aggregate functions. > > However, it seems the better (or perhaps only) way to do this is through > the MapOrListWriter (ComplexWriter) as long as there's a way to instruct > the writer to write to a specific output index (the output index is needed > because there are several groups in the output container and we want to > write to a specific one). > > -Aman > > > On Wed, Apr 11, 2018 at 2:13 PM, Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > What semantics are wanted? SetSafe sets a single value in a vector. What > > does it mean to set a single map or array value? What would we pass as an > > argument? > > For non-simple types, something needs to iterate over the values: be they > > elements of a map, elements in an array, elements of an array of maps, > then > > over the map members, etc. > > I believe that you are hitting a fundamental difference between simple > > scale values and complex (composite) values. > > This is for an aggregate. There is no meaningful aggregate of a map or an > > array. Once could aggregate over a scalar that is a member of a map to > > produce a scalar result. Or, one could iterate over the members of an > array > > to produce, say, an average or sum. > > You are dealing with aggregate UDFs (even built in functions implement > the > > UDAF protocol.) A quick check of the source code does not find a > > "AnyValueComplexFunctions" class, so this may perhaps be something new > you > > are developing. What are the desired semantics? > > The UDAF protocol can include a complex writer for maps. I've not played > > with that yet. But, it does not seem meaningful to aggregate a map to > > produce a map or to aggregate an array to produce an array. The idea is > > that the UDAF figures out what to do with maps, then uses the complex > > writer to produce the desired result. This makes sense since there is no > > way to store a map as a simple value passed to setSafe(). > > Can you provide additional details of what you are trying to do? > > Thanks, > > - Paul > > > > > > > > On Wednesday, April 11, 2018, 1:53:12 PM PDT, Padma Penumarthy < > > ppenumar...@mapr.com> wrote: > > > > I guess you can add a setSafe method which recursively does setSafe for > > all children. > > > > Thanks > > Padma > > > > > > > On Apr 11, 2018, at 1:19 PM, Gautam Parai <gpa...@mapr.com> wrote: > > > > > > Hi Paul/Padma, > > > > > > > > > Thank you so much for the responses. This function is supposed to > return > > `any value` from the batch of incoming rows. Hence, the need to handle > > maps/lists. > > > > > > > > > This codegen is for the StreamingAggregator for Complex type(e.g. maps) > > in the incoming batch. It is trying to assign the values in the > > ComplexHolder to the outgoing MapVector. > > > > > > > > > MapVector vv9; // Output VV of StreamingAgg > > > > > > .... > > > > > > > > > public void outputRecordValues(int outIndex) > > > > > > throws SchemaChangeException > > > > > > { > > > > > > { > > > > > > ComplexHolder out8; > > > > > > { > > > > > > final ComplexHolder out = new ComplexHolder(); > > > > > > FieldReader fr = work0; > > > > > > MapHolder value = work1; > > > > > > BigIntHolder nonNullCount = work2; > > > > > > > > > > > > AnyValueComplexFunctions$MapAnyValue_output: { > > > > > > out.reader = fr; > > > > > > } > > > > > > > > > > > > work0 = fr; > > > > > > work1 = value; > > > > > > work2 = nonNullCount; > > > > > > out8 = out; > > > > > > } > > > > > > vv9 .getMutator().setSafe((outIndex), out8); //Don't have > > setSafe for MapVector > > > > > > } > > > > > > } > > > > > > > > > Please let me know your thoughts. > > > > > > > > > Gautam > > > > > > > > > > > > ________________________________ > > > From: Paul Rogers <par0...@yahoo.com.INVALID> > > > Sent: Wednesday, April 11, 2018 12:40:15 PM > > > To: dev@drill.apache.org > > > Subject: Re: [DISCUSS] Regarding mutator interface > > > > > > Note that, for maps and lists, there is nothing to set. Maps are purely > > containers for other vectors. Lists (you didn't mention whether > "repeated" > > or "non-repeated") are also containers. Non-repeated lists are containers > > for unions, repeated-lists are containers for arrays. > > > Any setting should be done on the contained vectors. For lists, only > the > > offset vector is updated. > > > So, another question is: what is the generated code trying to set? > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Wednesday, April 11, 2018, 12:33:52 PM PDT, Padma Penumarthy < > > ppenumar...@mapr.com> wrote: > > > > > > Can you explain how aggregation on complex type works (or supposed to > > work). > > > > > > Thanks > > > Padma > > > > > > > > >> On Apr 11, 2018, at 12:15 PM, Gautam Parai <gpa...@mapr.com> wrote: > > >> > > >> Hi all, > > >> > > >> > > >> I am implementing a new aggregate function which also handles Complex > > types (map and list). However, the codegen barfs with > > >> > > >> > > >> CompileException: Line 104, Column 39: A method named "setSafe" is not > > declared in any enclosing class nor any supertype, nor through a static > > import > > >> > > >> > > >> It looks like we do not have set()/ setSafe() methods for > > MapVector/ListVector mutators. > > >> > > >> > > >> Should we add these methods to the Mutator interface to ensure all > > mutators implement them? Is these a reason we chose not to do so? > > >> > > >> > > >> Please let me know your thoughts. Thanks! > > >> > > >> > > >> Gautam > > > > > > > > >