Hi Francois, I haven't read the details of your use case but just want to make sure you have looked at the nested data functions and ruled them out for your requirement: https://drill.apache.org/docs/nested-data-functions/
-Aman On Thu, Aug 31, 2017 at 8:23 AM, François Méthot <[email protected]> wrote: > I manage to implement a single UDF that returns a copy of a MapHolder input > var, it allowed me to figure how to use SingleMapReaderImpl input and > ComplexWriter as out. > > I tried to move that approach into an aggregation function that looks like > the snippet below. > I want to return the first MapHolder value encountered in a group by > operation. > > select firstvalue(tb1.field1), firstvalue(tb1.field2), > firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1 > group by tb1.field4.key_data > > I get: > Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new > vector for minor type [LATE] and mode [REQUIRED] > > I am not sure, if we can use the "out" variable within the add method. > Any hint from the experts to put me on track would be appreciated. > > Thanks > Francois > > > @FunctionTemplate(name = "firstvalue", scope = > FunctionTemplate.FunctionScope.POINT_AGGREGATE) > public static class BitCount implements DrillAggFunc{ > public static class FirstValueComplex implements DrillAggFunc > { > @Param > MapHolder map; > > @Workspace > BitHolder firstSeen; > > @Output > ComplexWriter out; > > @Override > public void Setup() > { > firstSeen.value=0; > } > > @Override > public void add() > { > if(firstSeen.value == 0) > { > org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl > map > = > (org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object) > map; > map.copyAsValue(out.rootAsMap()); > firstSeen.value=1; > } > } > > @Override > public void output() > { > > } > > @Override > public void reset() > { > out.clear(); > firstSeen.value=0; > } > } > } > > On 30 August 2017 at 16:57, François Méthot <[email protected]> wrote: > > > > > Hi, > > > > Congrat for the 1.11 release, we are happy to have our suggestion > > implemented in the new release (automatic HDFS block size for parquet > > files). > > > > It seems like we are pushing the limit of Drill with new type query...(I > > am learning new SQL trick in the process) > > > > We are trying to aggregate a json document based on a nested value. > > > > Document looks like this: > > > > { > > "field1" : { > > "f1_a" : "infoa", > > "f1_b" : "infob" > > }, > > "field2" : "very long string", > > "field3" : { > > "f3_a" : "infoc", > > "f3_b" : "infod", > > "f4_c" : { > > .... > > } > > }, > > "field4" : { > > "key_data" : "String to aggregate on", > > "f4_b" : "a string2", > > "f4_c" : { > > .... complex structure... > > } > > } > > } > > > > > > We want a first, or last (or any) occurrence of field1, field2, field3 > and > > field4 group by field4.key_data; > > > > > > Unfortunately min, max function does not support json complex column > > (MapHolder). Therefor group by type of queries do not work. > > > > We tried a window function like this > > create table .... as ( > > select first_value(tb1.field1) over (partition by tb1.field4.key_data) > > as field1, > > first_value(tb1.field2) over (partition by tb1.field4.key_data) as > > field2, > > first_value(tb1.field3) over (partition by tb1.field4.key_data) as > > field3, > > first_value(tb1.field4) over (partition by tb1.field4.key_data) as > > field4 > > from dfs.`doc.json` tb1; > > ) > > > > We get IndexOutOfBoundException. > > > > We got better success with: > > create table .... as ( > > select * from > > (select tb1.*, > > row_number() over (partition by tb1.field4.key_data) as row_num > > from dfs.`doc.json` tb1 > > ) t > > where t.row_num = 1 > > ) > > > > This works on single json file or with multiple file in a session > > configured with planner.width_max_per_node=1. > > > > As soon as we put more than 1 thread per query, We get > > IndexOutOfBoundException. > > This was tried on 1.10 and 1.11. > > It looks like a bug. > > > > > > Would you have other suggestion to bypass that issue? > > Is there an existing aggregation function (to work with group by) that > > would return the first,last, or random MapHolder column from json > document? > > If not, I am thinking of implementing one, would there be an example on > > how to Clone a MapHolder within a function? (pretty sure I can't assign > > "in" param to output within a function) > > > > > > Thank you for your time reading this. > > any suggestions to try are welcome > > > > Francois > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
