Re: Dedupping json records based on nested value

Aman Sinha Thu, 31 Aug 2017 08:38:34 -0700

Hi Francois,
I haven't read the details of your use case but just want to make sure you
have looked at the nested data functions and ruled them out for your
requirement:  https://drill.apache.org/docs/nested-data-functions/


-Aman

On Thu, Aug 31, 2017 at 8:23 AM, François Méthot <[email protected]>
wrote:

> I manage to implement a single UDF that returns a copy of a MapHolder input
> var, it allowed me to figure how to use SingleMapReaderImpl input and
> ComplexWriter as out.
>
> I tried to move that approach into an aggregation function that looks like
> the snippet below.
> I want to return the first MapHolder value encountered in a group by
> operation.
>
> select firstvalue(tb1.field1), firstvalue(tb1.field2),
> firstvalue(tb1.field3), firstvalue(tb1.field4) from dfs.`doc.json` tb1
> group by tb1.field4.key_data
>
> I get:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unable to get new
> vector for minor type [LATE] and mode [REQUIRED]
>
> I am not sure, if we can use the "out" variable within the add method.
> Any hint from the experts to put me on track would be appreciated.
>
> Thanks
> Francois
>
>
> @FunctionTemplate(name = "firstvalue", scope =
> FunctionTemplate.FunctionScope.POINT_AGGREGATE)
> public static class BitCount implements DrillAggFunc{
>    public static class FirstValueComplex implements DrillAggFunc
>    {
>      @Param
>      MapHolder map;
>
>      @Workspace
>      BitHolder firstSeen;
>
>      @Output
>      ComplexWriter out;
>
>      @Override
>      public void Setup()
>      {
>         firstSeen.value=0;
>      }
>
>      @Override
>      public void add()
>      {
>        if(firstSeen.value == 0)
>        {
>           org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl
> map
> =
> (org.apache.drill.exec.vector.complex.impl.SingleMapReaderImpl)(Object)
> map;
>           map.copyAsValue(out.rootAsMap());
>           firstSeen.value=1;
>        }
>      }
>
>      @Override
>      public void output()
>      {
>
>      }
>
>      @Override
>      public void reset()
>      {
>        out.clear();
>        firstSeen.value=0;
>      }
>    }
> }
>
> On 30 August 2017 at 16:57, François Méthot <[email protected]> wrote:
>
> >
> > Hi,
> >
> > Congrat for the 1.11 release, we are happy to have our suggestion
> > implemented in the new release (automatic HDFS block size for parquet
> > files).
> >
> > It seems like we are pushing the limit of Drill with new type query...(I
> > am learning new SQL trick in the process)
> >
> > We are trying to aggregate a json document based on a nested value.
> >
> > Document looks like this:
> >
> > {
> >  "field1" : {
> >      "f1_a" : "infoa",
> >      "f1_b" : "infob"
> >   },
> >  "field2" : "very long string",
> >  "field3" : {
> >      "f3_a" : "infoc",
> >      "f3_b" : "infod",
> >      "f4_c" : {
> >           ....
> >       }
> >   },
> >   "field4" : {
> >      "key_data" : "String to aggregate on",
> >      "f4_b" : "a string2",
> >      "f4_c" : {
> >           .... complex structure...
> >       }
> >   }
> > }
> >
> >
> > We want a first, or last (or any) occurrence of field1, field2, field3
> and
> > field4 group by field4.key_data;
> >
> >
> > Unfortunately min, max function does not support json complex column
> > (MapHolder). Therefor group by type of queries do not work.
> >
> > We tried a window function like this
> > create table .... as (
> >   select first_value(tb1.field1) over (partition by tb1.field4.key_data)
> > as field1,
> >        first_value(tb1.field2) over (partition by tb1.field4.key_data) as
> > field2,
> >        first_value(tb1.field3) over (partition by tb1.field4.key_data) as
> > field3,
> >        first_value(tb1.field4) over (partition by tb1.field4.key_data) as
> > field4
> >   from dfs.`doc.json` tb1;
> > )
> >
> > We get IndexOutOfBoundException.
> >
> > We got better success with:
> > create table .... as (
> >  select * from
> >   (select tb1.*,
> >           row_number() over (partition by tb1.field4.key_data) as row_num
> >    from  dfs.`doc.json` tb1
> >   ) t
> >  where t.row_num = 1
> > )
> >
> > This works on single json file or with multiple file in a session
> > configured with planner.width_max_per_node=1.
> >
> > As soon as we put more than 1 thread per query, We get
> > IndexOutOfBoundException.
> > This was tried on 1.10 and 1.11.
> > It looks like a bug.
> >
> >
> > Would you have other suggestion to bypass that issue?
> > Is there an existing aggregation function (to work with group by) that
> > would return the first,last, or random MapHolder column from json
> document?
> > If not, I am thinking of implementing one, would there be an example on
> > how to Clone a MapHolder within a function? (pretty sure I can't assign
> > "in" param to output within a function)
> >
> >
> > Thank you for your time reading this.
> > any suggestions to try are welcome
> >
> > Francois
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Dedupping json records based on nested value

Reply via email to