Hi Paul,
yes, the any_value function will need to have separate generated code for
different data types and mode combinations.  I believe Gautam has
implemented all the scalar types (through the standard template mechanism
that Drill follows).  The complex types are the ones that are harder.

> Moreover, the code generator must understand that code generated for a
Map UDAF must be different than that for a scalar UDAF. Presumably we must
have that code, since the UDF mechanism supports maps.

Yes, I assume you are referring to the decision point here [1].

There is some overlap with what we do with MAPPIFY/KVGEN function which
occurs as part of the Project operator.  ProjectRecordBatch generates code
for the functions that require ComplexWriter.    The MAPPIFY function reads
data using a FieldReader [2]  and outputs data using a ComplexWriter.
 However, there are differences with how ANY_VALUE operates particularly
because we want to treat it as an Aggregate function.  For instance, a
ValueReference in a ComplexWriter is always marked as LATE binding type [3]
whereas for ANY_VALUE we want it to reflect the input type.  Code
generation for either StreamingAgg or HashAgg does not like LATE type.  So,
this is a new requirement which potentially needs some changes to
ValueReference.

Regarding repeated maps/arrays, let me discuss with Gautam about the
details and will provide an update.

For Hash Agg versus Streaming Agg, I have some thoughts that I will send
out in a follow-up email.

[1]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java#L108
[2]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/Mappify.java#L55
[3]
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/ValueReference.java#L76

-Aman

On Fri, Apr 13, 2018 at 10:31 AM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Aman & Gautam,
>
> FWIW, here is my understanding of UDAF functions based on a write-up I did
> a few months back. [1]
>
> All Drill functions are implemented using the UDF and UDAF protocol. (The
> "U" (User) is a bit of a misnomer, internal and user-defined functions
> follow the same protocol.) Every UDF (including UDAF) is strongly typed in
> its arguments and return value. To create the ANY_VALUE implementation, you
> must create a separate implementation for each and every combination of
> (type, mode). That is, we need a REQUIRED INT and OPTIONAL INT
> implementation for INT types.
>
> In this case, the incoming argument is passed using the argument
> annotation, and the return value via the return annotation. The generated
> code sets the incoming argument and copies the return value from the return
> variable. (There is an example of the generated code in the write-up.)
>
> For a Map, there is no annotation to say "set this value to a map" for
> either input or output. Instead, we pass in a complex reader for input (I
> believe) and a complex writer for output. (Here I am a bit hazy as I never
> had time to experiment with maps and arrays in UDFs.)
>
> So, you'll need a Map implementation. (Maps can only be REQUIRED, never
> OPTIONAL, unless they are in a UNION or LIST...)
>
> Moreover, the code generator must understand that code generated for a Map
> UDAF must be different than that for a scalar UDAF. Presumably we must have
> that code, since the UDF mechanism supports maps.
>
> Have you worked out how to handle arrays (REPEATED cardinality?) It was
> not clear from my exploration of UDFs how we handle REPEATED types. The
> UDAF must pass in one array, which the UDAF copies to its output, which is
> then written to the output repeated vector. Since values must arrive in
> Holders, it is not clear how this would be done for arrays. Perhaps there
> is an annotation that lets us use some form of complex writer for arrays as
> is done for MAPs? Again, sorry, I didn't have time to learn that bit. Would
> be great to understand that so we can add it to the write-up.
>
> This chain mentions a MAP type. Drill also includes other complex types:
> REPEATED MAP, (non-repeated) LIST, (repeated) LIST, and UNION. It is not at
> all clear how UDAFs work for these types.
>
> One other thing to consider: ANY_VALUE can never work for the hash agg
> because output values are updated in random order. It can only ever work
> for streaming agg because the streaming agg only appends output values.
> Fortunately, this chain is about the streaming agg. For Hash Agg,
> intermediate variable-width values are stored in an Object Vector, but
> those values won't survive serialization. As a result, only fixed-width
> types can be updated in random order. DRILL-6087 describes this issue.
>
> Thanks,
> - Paul
>
> [1] https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
>
>
>
>
>
>
>     On Wednesday, April 11, 2018, 4:09:47 PM PDT, Aman Sinha <
> amansi...@apache.org> wrote:
>
>  Here's some background on what Gautam is trying to do:  Currently, SQL
> does
> not have a standard way to do a DISTINCT on a subset of the columns in the
> SELECT list.  Suppose there are 2 columns:
>   a:  INTEGER
>   b:  MAP
> Suppose I want to only do DISTINCT on 'a' and I don't really care about the
> column 'b' .. I just want the first or any value of 'b' within a single
> group of 'a'.    Postgres actually has a 'DISTINCT ON(a), b' syntax but
> based on our discussion on the Calcite mailing list, we want to avoid that
> syntax.  So, there's an alternative proposal to do the following:
>
>     SELECT a, ANY_VALUE(b) FROM table GROUP BY a
>
> This means, ANY_VALUE will essentially be treated as an Aggregate function
> and from a code-gen perspective, we want to read 1 item (a MapHolder) from
> the incoming MapVector and write it to a particular index in the output
> MapVector.    This is where  it would be useful to  have
> MapVector.setSafe()  since the StreamingAgg and HashAgg both generate
> setSafe()  for normal aggregate functions.
>
> However, it seems the better (or perhaps only) way to do this is through
> the MapOrListWriter (ComplexWriter) as long as there's a way to instruct
> the writer to write to a specific output index (the output index is needed
> because there are several groups in the output container and we want to
> write to a specific one).
>
> -Aman
>
>
> On Wed, Apr 11, 2018 at 2:13 PM, Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > What semantics are wanted? SetSafe sets a single value in a vector. What
> > does it mean to set a single map or array value? What would we pass as an
> > argument?
> > For non-simple types, something needs to iterate over the values: be they
> > elements of a map, elements in an array, elements of an array of maps,
> then
> > over the map members, etc.
> > I believe that you are hitting a fundamental difference between simple
> > scale values and complex (composite) values.
> > This is for an aggregate. There is no meaningful aggregate of a map or an
> > array. Once could aggregate over a scalar that is a member of a map to
> > produce a scalar result. Or, one could iterate over the members of an
> array
> > to produce, say, an average or sum.
> > You are dealing with aggregate UDFs (even built in functions implement
> the
> > UDAF protocol.) A quick check of the source code does not find a
> > "AnyValueComplexFunctions" class, so this may perhaps be something new
> you
> > are developing. What are the desired semantics?
> > The UDAF protocol can include a complex writer for maps. I've not played
> > with that yet. But, it does not seem meaningful to aggregate a map to
> > produce a map or to aggregate an array to produce an array. The idea is
> > that the UDAF figures out what to do with maps, then uses the complex
> > writer to produce the desired result. This makes sense since there is no
> > way to store a map as a simple value passed to setSafe().
> > Can you provide additional details of what you are trying to do?
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, April 11, 2018, 1:53:12 PM PDT, Padma Penumarthy <
> > ppenumar...@mapr.com> wrote:
> >
> >  I guess you can add a setSafe method which recursively does setSafe for
> > all children.
> >
> > Thanks
> > Padma
> >
> >
> > > On Apr 11, 2018, at 1:19 PM, Gautam Parai <gpa...@mapr.com> wrote:
> > >
> > > Hi Paul/Padma,
> > >
> > >
> > > Thank you so much for the responses. This function is supposed to
> return
> > `any value` from the batch of incoming rows. Hence, the need to handle
> > maps/lists.
> > >
> > >
> > > This codegen is for the StreamingAggregator for Complex type(e.g. maps)
> > in the incoming batch. It is trying to assign the values in the
> > ComplexHolder to the outgoing MapVector.
> > >
> > >
> > > MapVector vv9; // Output VV of StreamingAgg
> > >
> > > ....
> > >
> > >
> > >    public void outputRecordValues(int outIndex)
> > >
> > >        throws SchemaChangeException
> > >
> > >    {
> > >
> > >        {
> > >
> > >            ComplexHolder out8;
> > >
> > >            {
> > >
> > >                final ComplexHolder out = new ComplexHolder();
> > >
> > >                FieldReader fr = work0;
> > >
> > >                MapHolder value = work1;
> > >
> > >                BigIntHolder nonNullCount = work2;
> > >
> > >
> > >
> > > AnyValueComplexFunctions$MapAnyValue_output: {
> > >
> > >    out.reader = fr;
> > >
> > > }
> > >
> > >
> > >
> > >                work0 = fr;
> > >
> > >                work1 = value;
> > >
> > >                work2 = nonNullCount;
> > >
> > >                out8 = out;
> > >
> > >            }
> > >
> > >            vv9 .getMutator().setSafe((outIndex), out8); //Don't have
> > setSafe for MapVector
> > >
> > >        }
> > >
> > >    }
> > >
> > >
> > > Please let me know your thoughts.
> > >
> > >
> > > Gautam
> > >
> > >
> > >
> > > ________________________________
> > > From: Paul Rogers <par0...@yahoo.com.INVALID>
> > > Sent: Wednesday, April 11, 2018 12:40:15 PM
> > > To: dev@drill.apache.org
> > > Subject: Re: [DISCUSS] Regarding mutator interface
> > >
> > > Note that, for maps and lists, there is nothing to set. Maps are purely
> > containers for other vectors. Lists (you didn't mention whether
> "repeated"
> > or "non-repeated") are also containers. Non-repeated lists are containers
> > for unions, repeated-lists are containers for arrays.
> > > Any setting should be done on the contained vectors. For lists, only
> the
> > offset vector is updated.
> > > So, another question is: what is the generated code trying to set?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >    On Wednesday, April 11, 2018, 12:33:52 PM PDT, Padma Penumarthy <
> > ppenumar...@mapr.com> wrote:
> > >
> > > Can you explain how aggregation on complex type works (or supposed to
> > work).
> > >
> > > Thanks
> > > Padma
> > >
> > >
> > >> On Apr 11, 2018, at 12:15 PM, Gautam Parai <gpa...@mapr.com> wrote:
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> I am implementing a new aggregate function which also handles Complex
> > types (map and list). However, the codegen barfs with
> > >>
> > >>
> > >> CompileException: Line 104, Column 39: A method named "setSafe" is not
> > declared in any enclosing class nor any supertype, nor through a static
> > import
> > >>
> > >>
> > >> It looks like we do not have set()/ setSafe() methods for
> > MapVector/ListVector mutators.
> > >>
> > >>
> > >> Should we add these methods to the Mutator interface to ensure all
> > mutators implement them? Is these a reason we chose not to do so?
> > >>
> > >>
> > >> Please let me know your thoughts. Thanks!
> > >>
> > >>
> > >> Gautam
> > >
> >
> >
>
>

Reply via email to