Re: Re: UNNEST() operation for arrays

Gabor Kaszab Mon, 13 Sep 2021 04:43:27 -0700

Hey All,
Thanks for taking a look and sharing your view. The whole idea of having
multiple UNNEST() functions in the select list (and zipping the results)
came from a user as a direct feature request as they try to move to Impala
from Greenplum and as they say they can't rewrite their existing queries.


The most straightforward way would be to implement UNNEST() as a UDTF but
Impala doesn't support them, and anyway, as mentioned above Impala doesn't
support even UDFs with nested type parameters.

My idea was something like introducing an unnest SlotRef or such, change
the SQL parser to translate each UNNEST(array) into this SlotRef and then
do a query rewrite based on that so that we can process the unnest. My
preference would be to rewrite the query to the syntax we have now with
joining the arrays in the FROM clause as collection table refs and then
have a feature flag (e.g. unnest_join_type) to indicate whether we want the
joined or the zipped behaviour. For the zipped one we can modify the
existing UNNEST node to be able to receive more arrays and then do the zip
itself.

What do you think?

Gabor

On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pengdou1...@126.com> wrote:

> Hi  All,
>
>
> Currently IMPALA don't support complex data type as input/output
> parameters for Bulltin Functions / User Defined Functions,
> If IMPALA can support this, the Unnest()  or other  operation on  complex
> data type may easy to achieve.
>
>
>
>
> --
>
> Best Regards
> Dou
>
>
>
>
>
> At 2021-09-13 15:08:52, "Quanlong Huang" <huangquanl...@gmail.com> wrote:
> >This seems like the explode() UDTF in Hive:
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> >
> >Hive doesn't support multiple UDTF in the SelectList. I think we can make
> >the same choice that only supports one UNNEST() in the SelectList. If
> >multiple columns need to be unfolded, they should be wrapped in the same
> >UNNEST() function with a keyword/func specifying how to join the rows.
> E.g.
> >
> >-- cross join by default
> >select UNNEST(array_a, array_b) from tbl;
> >-- specifying the join type
> >select UNNEST(zip array_a, array_b) from tbl;
> >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> >
> >explode() in Hive only supports one column. The above syntax is inspired
> by
> >SparkSQL:
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> >
> >On a separate note, what about using the same name(explode) as Hive? I
> >think it'd be helpful for migrations into Impala, especially if we also
> >support the LATERAL VIEW syntax.
> >
> >Thanks,
> >Quanlong
> >
> >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <tim.g.armstr...@gmail.com>
> >wrote:
> >
> >> The syntax/behaviour seems kinda unusual to me - do any other systems
> aside
> >> from postgres implement this? UNNEST() in the from clause seems more
> >> common, and it is generally not consistent with the rest of SQL for
> SELECT
> >> clause entries to affect the set of rows returned.
> >>
> >> I'm not sure but it seems like it might be tricky to make the select
> clause
> >> in the Impala frontend do this.
> >>
> >> Some way of doing this in FROM clause seems more intuitive and
> consistent
> >> with the rest of sql, e.g.
> >>
> >> select id, e1, e2
> >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> >>
> >> I'm tempted to say that it should be called ZIP() since that's the name
> of
> >> this operation in functional programming, but I don't think inventing
> new
> >> SQL keywords is really ideal.
> >>
> >> I'm also a little curious about the use case for this. I think ideally
> you
> >> would represent two arrays with correlated indices as an array of
> structs
> >> instead of two arrays, but I can appreciate that schema designs may not
> >> always be fully normalised.
> >>
> >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <amansi...@gmail.com> wrote:
> >>
> >> > Hi Gabor,
> >> > An implicit assumption in the proposed syntax is that there is some
> >> > association between an item at a particular position in the first
> array
> >> and
> >> > the same position in the second array.
> >> > However,  in general the arrays are not ordered in a semantic sense.
> e.g
> >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
> >> > proposed UNNEST behavior will not produce any rows where {Honda,
> Accord}
> >> > appear together which would be unexpected.  Although one could make
> the
> >> > argument that in order to get this behavior you could fall back to the
> >> > current syntax of putting the arrays in the FROM clause, it could be
> >> > confusing to explain when one should be used vs other.  It would be
> >> useful
> >> > to see what other systems (apart from Postgres) which have more native
> >> > complex types support do in this case (e.g Presto, Couchbase,
> Bigquery).
> >> >
> >> > One alternative I can think of is to make the association between the
> >> items
> >> > in multiple arrays be more explicit e.g by using an outer join on the
> >> > ordinal position. That way, items at the same position appear together
> >> in a
> >> > row and for non-matches we produce Null for the smaller array. I
> haven't
> >> > thought through the exact syntax for this yet.
> >> >
> >> > -Aman
> >> >
> >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> gaborkas...@cloudera.com>
> >> > wrote:
> >> >
> >> > > Hey,
> >> > >
> >> > > I'm working on implementing a new UNNEST() operation that does what
> it
> >> > > says, unnests arrays. But here the trick would come when there are
> >> > multiple
> >> > > UNNESTS() in the select list. Let me explain the desired behaviour
> with
> >> > an
> >> > > example:
> >> > >
> >> > > Let's assume the following table:
> >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> >> > > An the following query:
> >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> >> > >
> >> > > Here the expected result is to not produce a join on the two arrays
> as
> >> we
> >> > > would with the original way of giving them in the select lit, but to
> >> put
> >> > > the values of the arrays "next to each other" like this:
> >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> >> > > I have investigated some options how this could be implemented and
> >> > created
> >> > > a doc for my notes. Let me share this with the community so that
> people
> >> > can
> >> > > share their preference. Any feedback is welcome!
> >> > >
> >> > >
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> >> > >
> >> > > Cheers,
> >> > > Gabor
> >> > >
> >> > >
> >> > >
> >> >
> >>
>

Re: Re: UNNEST() operation for arrays

Reply via email to