Re: Passing multiple columns to a UDAF

Jason Altekruse Tue, 31 Mar 2015 13:05:14 -0700

Hi Shadi,

Unfortunately that isn't going to be a good strategy. We actually removed
the RecordBatch entirely from the UDF interfaces recently to prevent
exposing so much information to UDFs. To do something like this we would
want to
 define a new interface to UDFs.

One shortcoming that I believe is related to what you are trying to do, is
the inability to consider the top level schema of a Drill record in the
same way we currently consider the complex map type. Drill currently
supports passing non-scalar values in the form of maps and repeated types
into UDFs (these maps and lists can be nested within one another to make
nearly arbitrarily complex data structures). The interface for passing in
these structures is the FieldReader, which is much like an iterator/visitor
over a tree structure. The two functions that use this interface today are
convertTo_JSON and kvgen (also called mappify). Both of these functions
take a complex object as input, convertTo_JSON produces a VarChar with the
JSON representation and kvgen applies a transformation to the data to make
the key values in a map queryable (more information in the wiki link
below[1]).

The important thing to note, is that these functions can only be invoked on
a particular field in the schema. It would make sense to allow them to be
invoked on the entire root schema, treating it like a map itself, possibly
with syntax like convertTo_JSON(*) (NOTE: this is not supported right now,
and hasn't even been in a design doc, this will not work today)

For example, these two datasets:

flat schema:
----------------
{
    "a" : 1,
    "b" : 2
}

complex schema:
-----------------------
{
    "data" : {
          "a" : 1,
          "b"
    }
}

The first dataset can only be used to access the individual data members
with the syntax: table_name.a

However the second one can pass multiple fields into a function for
processing, because the data is stored under a map at the root of the
schema, such as producing JSON in a varchar using: convertTo_JSON(data)

If you are willing to change the structure of your incoming data, I think
that this might be a viable strategy for passing a variable number of
arguments into a function. This will have the restriction today of having a
single data type within any lists used, but if there is a discrete number
of possible traits you should be able to use a map instead of a list, and
nested field within a map can have different data types, i.e you cannot
currently have a mixed type array like [1, true, "a string"], but you could
put them either in their own fields { "a_number" :1, "a_bool" : true,
"a_str" : "a string"} or have lists for each type nested down inside of the
map { "list_numbers" : [1], "list_bools" : [ true ], "list_strings" : ["a
string"] }

As long as I've written this much, I should say that this alternate
strategy will currently only work if you change the source data. We do not
support the concept of re-nesting data within the query. Say you wanted to
use an array to pass a variable number of arguments. If the source data had
the data in separate fields, we currently *do not* support something like
select field_1 as new_list[0], field_2 as new_list[1]. Again as before,
this hasn't even been fully discussed, so this will not work today and this
doesn't represent a declaration of how this may work in Drill in the
future, its just to demonstrate what we don't do today. If this feature
existed, you could use this new list in an outer query and pass it in as
your variable length argument to your function. To do something like this
today, you have to modify the source data to put it in this form.

To see how the FieldReader is used, check out this function definition in
the Drill source:
org.apache.drill.exec.expr.fn.impl.Mappify

Documentation on its usage in queries
[1] https://cwiki.apache.org/confluence/display/DRILL/KVGEN+Function

On Tue, Mar 31, 2015 at 10:24 AM, Shadi Khalifa <[email protected]>
wrote:

> I wonder if I can extract this data from the RecordBatch? any ideas?
> Regards
> Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
> I'm just a neuron in the society collective brain
>
> 01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> 01100111 01111001 01110000 01110100
> P Please consider your environmental responsibility before printing this
> e-mail
>
>
>
>
>      On Tuesday, March 31, 2015 1:16 PM, Jacques Nadeau <
> [email protected]> wrote:
>
>
>  It isn't yet supported but is something I think a lot of people would find
> useful.  Depending on how ambitious you are, maybe you could pick it up?
>
> On Tue, Mar 31, 2015 at 10:05 AM, Shadi Khalifa <[email protected]>
> wrote:
>
> > Hello everyone,
> > I wonder if there is a way to send a variable number (Array) of
> attributes
> > (columns) to a custom user defined aggregate function.
> > I want to be able to have something like:Select myAggrFn(col1,col2,...,
> > coln) from mytable;
> >
> > I wonder if there is something like the following or anything else that
> > can handle this case:@FunctionTemplate(name = "myAggrFn", scope =
> > FunctionTemplate.FunctionScope.POINT_AGGREGATE)public static class
> > MyAggrFnimplements DrillAggFunc{  @Param ObjectHolder[] in;
> >
> >  I know it's weird to have a function like that, but I'm implementing
> > machine learning into Drill and need to pass some columns or maybe the
> > whole row to the aggregate function to train and use the model.
> > Regards
> > Shadi KhalifaPhD CandidateSchool of Computing Queen's University Canada
> > I'm just a neuron in the society collective brain
> >
> > 01001001 00100000 01101100 01101111 01110110 01100101 00100000 01000101
> > 01100111 01111001 01110000 01110100
> > P Please consider your environmental responsibility before printing this
> > e-mail
> >
> >
>
>
>
>

Re: Passing multiple columns to a UDAF

Reply via email to