Re: Re: UNNEST() operation for arrays

Tim Armstrong Thu, 30 Sep 2021 23:33:59 -0700

Thanks for responding in detail, I agree with your points and it's
interesting that the SQL standard has something similar - I didn't know
that.


> I checked in Postgres and apparently you can unnest arrays from multiple
> tables and they still get zipped. I think for the first iteration this
> would be overkill to implement as Impala puts the arrays from different
> tables into a different subplan making it a bit more complex to implement,
> so I'd just restrict unnest() to a single table at this point.
Agreed! I'm not sure why you would want to do this - maybe an unusual
schema design.

On Tue, 28 Sept 2021 at 06:07, Gabor Kaszab <gaborkas...@cloudera.com>
wrote:

> Thanks for sharing your view, Tim! And sorry for the late answer.
>
> *Feature flag:*
> The more thought I give this the more I feel that using a feature flag
> indeed would be confusing. Additionally, I like the idea of having a
> special syntax for the zipping unnest.
>
> *Syntax:*
> I checked the SQL standard and it says something really similar to your
> suggestion:
> Example from the standard:
> SELECT M.E FROM UNNEST (MVE) AS M(E)
>
> <collection derived table> ::=
>   UNNEST <left paren> <collection value expression>
> [ { <comma> <collection value expression> }... ] <right paren> [ WITH
> ORDINALITY ]
>
> And you can specify aliases for the <collection derived table>s with the AS
> keyword and even the standard mentions multiple collection parameters for
> the unnest(). So let's have the query rewrite this way:
> SELECT UNNEST(t.a1), UNNEST(t.a2) FROM t;
> ==>
> SELECT _tmp1, _tmp2 FROM t, UNNEST(t.a1, t.a2) AS (_tmp1, tmp2);
>
> *Arrays from multiple tables:*
> I checked in Postgres and apparently you can unnest arrays from multiple
> tables and they still get zipped. I think for the first iteration this
> would be overkill to implement as Impala puts the arrays from different
> tables into a different subplan making it a bit more complex to implement,
> so I'd just restrict unnest() to a single table at this point.
>
> *UnnestTableRef:*
> I also thought that these unnests in the FROM clause should go into a kind
> of TableRef (e.g. UnnestTableRef?, ZippingUnnestTableRef) but there I found
> that each TableRef has a single rawPath_ and this might not be the best for
> a single TableRef descendant holding multiple arrays with multiple
> rawPath_s. But this is not a big issue, I'm sure there is another way to
> get around this.
>
> Cheers,
> Gabor
>
>
> On Tue, Sep 14, 2021 at 7:36 AM Tim Armstrong <tim.g.armstr...@gmail.com>
> wrote:
>
> > I think compatibility with Postgres derivatives generally makes sense as
> a
> > goal, even if the behaviour isn't ideal - definitely much easier than
> > rewriting everything to a different syntax.
> >
> > With your rewrite proposal, would you also be able to express these
> queries
> > by unnesting in the FROM clause?
> >
> > SELECT unnest(t.a) c1, unnest(t.b) c1
> > FROM tbl t
> > =>
> > SELECT _tmp1 c1, _tmp2 c2
> > FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)
> >
> > I'm not sure a feature flag would work well for this, since you
> > might potentially have a more complex FROM clause with other unnesting
> and
> > I'm not sure how you would model that. I suspect it works out cleaner to
> > model the zipped arrays as a TableRef - for most purposes in analysis and
> > planning it's going to be treated the same as a TableRef that is just a
> > plain array of structs - the analysis/planner code for TableRefs, joins,
> > etc isn't the easiest to extend.
> >
> > I guess you could also rewrite it to something like this too, which would
> > really let you isolate the special handling to a simple SELECT block:
> > SELECT v._tmp1 c1, v._tmp2 c2
> > FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v
> >
> > Another question: what happens if the two arrays are from different table
> > references? Is that allowed? Do they still get zipped together?
> >
> > On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <gaborkas...@apache.org>
> > wrote:
> >
> > > Hey All,
> > > Thanks for taking a look and sharing your view. The whole idea of
> having
> > > multiple UNNEST() functions in the select list (and zipping the
> results)
> > > came from a user as a direct feature request as they try to move to
> > Impala
> > > from Greenplum and as they say they can't rewrite their existing
> queries.
> > >
> > > The most straightforward way would be to implement UNNEST() as a UDTF
> but
> > > Impala doesn't support them, and anyway, as mentioned above Impala
> > doesn't
> > > support even UDFs with nested type parameters.
> > >
> > > My idea was something like introducing an unnest SlotRef or such,
> change
> > > the SQL parser to translate each UNNEST(array) into this SlotRef and
> then
> > > do a query rewrite based on that so that we can process the unnest. My
> > > preference would be to rewrite the query to the syntax we have now with
> > > joining the arrays in the FROM clause as collection table refs and then
> > > have a feature flag (e.g. unnest_join_type) to indicate whether we want
> > the
> > > joined or the zipped behaviour. For the zipped one we can modify the
> > > existing UNNEST node to be able to receive more arrays and then do the
> > zip
> > > itself.
> > >
> > > What do you think?
> > >
> > > Gabor
> > >
> > > On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pengdou1...@126.com> wrote:
> > >
> > > > Hi  All,
> > > >
> > > >
> > > > Currently IMPALA don't support complex data type as input/output
> > > > parameters for Bulltin Functions / User Defined Functions,
> > > > If IMPALA can support this, the Unnest()  or other  operation on
> > complex
> > > > data type may easy to achieve.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best Regards
> > > > Dou
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2021-09-13 15:08:52, "Quanlong Huang" <huangquanl...@gmail.com>
> > > wrote:
> > > > >This seems like the explode() UDTF in Hive:
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > > > >
> > > > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> > > make
> > > > >the same choice that only supports one UNNEST() in the SelectList.
> If
> > > > >multiple columns need to be unfolded, they should be wrapped in the
> > same
> > > > >UNNEST() function with a keyword/func specifying how to join the
> rows.
> > > > E.g.
> > > > >
> > > > >-- cross join by default
> > > > >select UNNEST(array_a, array_b) from tbl;
> > > > >-- specifying the join type
> > > > >select UNNEST(zip array_a, array_b) from tbl;
> > > > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > > > >
> > > > >explode() in Hive only supports one column. The above syntax is
> > inspired
> > > > by
> > > > >SparkSQL:
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > > > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > > > >
> > > > >On a separate note, what about using the same name(explode) as
> Hive? I
> > > > >think it'd be helpful for migrations into Impala, especially if we
> > also
> > > > >support the LATERAL VIEW syntax.
> > > > >
> > > > >Thanks,
> > > > >Quanlong
> > > > >
> > > > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> > > tim.g.armstr...@gmail.com>
> > > > >wrote:
> > > > >
> > > > >> The syntax/behaviour seems kinda unusual to me - do any other
> > systems
> > > > aside
> > > > >> from postgres implement this? UNNEST() in the from clause seems
> more
> > > > >> common, and it is generally not consistent with the rest of SQL
> for
> > > > SELECT
> > > > >> clause entries to affect the set of rows returned.
> > > > >>
> > > > >> I'm not sure but it seems like it might be tricky to make the
> select
> > > > clause
> > > > >> in the Impala frontend do this.
> > > > >>
> > > > >> Some way of doing this in FROM clause seems more intuitive and
> > > > consistent
> > > > >> with the rest of sql, e.g.
> > > > >>
> > > > >> select id, e1, e2
> > > > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > > > >>
> > > > >> I'm tempted to say that it should be called ZIP() since that's the
> > > name
> > > > of
> > > > >> this operation in functional programming, but I don't think
> > inventing
> > > > new
> > > > >> SQL keywords is really ideal.
> > > > >>
> > > > >> I'm also a little curious about the use case for this. I think
> > ideally
> > > > you
> > > > >> would represent two arrays with correlated indices as an array of
> > > > structs
> > > > >> instead of two arrays, but I can appreciate that schema designs
> may
> > > not
> > > > >> always be fully normalised.
> > > > >>
> > > > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <amansi...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> > Hi Gabor,
> > > > >> > An implicit assumption in the proposed syntax is that there is
> > some
> > > > >> > association between an item at a particular position in the
> first
> > > > array
> > > > >> and
> > > > >> > the same position in the second array.
> > > > >> > However,  in general the arrays are not ordered in a semantic
> > sense.
> > > > e.g
> > > > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
> > >  The
> > > > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > > > Accord}
> > > > >> > appear together which would be unexpected.  Although one could
> > make
> > > > the
> > > > >> > argument that in order to get this behavior you could fall back
> to
> > > the
> > > > >> > current syntax of putting the arrays in the FROM clause, it
> could
> > be
> > > > >> > confusing to explain when one should be used vs other.  It would
> > be
> > > > >> useful
> > > > >> > to see what other systems (apart from Postgres) which have more
> > > native
> > > > >> > complex types support do in this case (e.g Presto, Couchbase,
> > > > Bigquery).
> > > > >> >
> > > > >> > One alternative I can think of is to make the association
> between
> > > the
> > > > >> items
> > > > >> > in multiple arrays be more explicit e.g by using an outer join
> on
> > > the
> > > > >> > ordinal position. That way, items at the same position appear
> > > together
> > > > >> in a
> > > > >> > row and for non-matches we produce Null for the smaller array. I
> > > > haven't
> > > > >> > thought through the exact syntax for this yet.
> > > > >> >
> > > > >> > -Aman
> > > > >> >
> > > > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > > > gaborkas...@cloudera.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Hey,
> > > > >> > >
> > > > >> > > I'm working on implementing a new UNNEST() operation that does
> > > what
> > > > it
> > > > >> > > says, unnests arrays. But here the trick would come when there
> > are
> > > > >> > multiple
> > > > >> > > UNNESTS() in the select list. Let me explain the desired
> > behaviour
> > > > with
> > > > >> > an
> > > > >> > > example:
> > > > >> > >
> > > > >> > > Let's assume the following table:
> > > > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > > > >> > > An the following query:
> > > > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > > > >> > >
> > > > >> > > Here the expected result is to not produce a join on the two
> > > arrays
> > > > as
> > > > >> we
> > > > >> > > would with the original way of giving them in the select lit,
> > but
> > > to
> > > > >> put
> > > > >> > > the values of the arrays "next to each other" like this:
> > > > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > > > >> > > I have investigated some options how this could be implemented
> > and
> > > > >> > created
> > > > >> > > a doc for my notes. Let me share this with the community so
> that
> > > > people
> > > > >> > can
> > > > >> > > share their preference. Any feedback is welcome!
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > > > >> > >
> > > > >> > > Cheers,
> > > > >> > > Gabor
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Re: UNNEST() operation for arrays

Reply via email to