Re: Re: UNNEST() operation for arrays

Gabor Kaszab Tue, 28 Sep 2021 10:08:44 -0700

Thanks for sharing your view, Tim! And sorry for the late answer.

*Feature flag:*
The more thought I give this the more I feel that using a feature flag
indeed would be confusing. Additionally, I like the idea of having a
special syntax for the zipping unnest.


*Syntax:*
I checked the SQL standard and it says something really similar to your
suggestion:
Example from the standard:
SELECT M.E FROM UNNEST (MVE) AS M(E)

<collection derived table> ::=
  UNNEST <left paren> <collection value expression>
[ { <comma> <collection value expression> }... ] <right paren> [ WITH
ORDINALITY ]

And you can specify aliases for the <collection derived table>s with the AS
keyword and even the standard mentions multiple collection parameters for
the unnest(). So let's have the query rewrite this way:
SELECT UNNEST(t.a1), UNNEST(t.a2) FROM t;
==>
SELECT _tmp1, _tmp2 FROM t, UNNEST(t.a1, t.a2) AS (_tmp1, tmp2);

*Arrays from multiple tables:*
I checked in Postgres and apparently you can unnest arrays from multiple
tables and they still get zipped. I think for the first iteration this
would be overkill to implement as Impala puts the arrays from different
tables into a different subplan making it a bit more complex to implement,
so I'd just restrict unnest() to a single table at this point.

*UnnestTableRef:*
I also thought that these unnests in the FROM clause should go into a kind
of TableRef (e.g. UnnestTableRef?, ZippingUnnestTableRef) but there I found
that each TableRef has a single rawPath_ and this might not be the best for
a single TableRef descendant holding multiple arrays with multiple
rawPath_s. But this is not a big issue, I'm sure there is another way to
get around this.

Cheers,
Gabor


On Tue, Sep 14, 2021 at 7:36 AM Tim Armstrong <tim.g.armstr...@gmail.com>
wrote:

> I think compatibility with Postgres derivatives generally makes sense as a
> goal, even if the behaviour isn't ideal - definitely much easier than
> rewriting everything to a different syntax.
>
> With your rewrite proposal, would you also be able to express these queries
> by unnesting in the FROM clause?
>
> SELECT unnest(t.a) c1, unnest(t.b) c1
> FROM tbl t
> =>
> SELECT _tmp1 c1, _tmp2 c2
> FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2)
>
> I'm not sure a feature flag would work well for this, since you
> might potentially have a more complex FROM clause with other unnesting and
> I'm not sure how you would model that. I suspect it works out cleaner to
> model the zipped arrays as a TableRef - for most purposes in analysis and
> planning it's going to be treated the same as a TableRef that is just a
> plain array of structs - the analysis/planner code for TableRefs, joins,
> etc isn't the easiest to extend.
>
> I guess you could also rewrite it to something like this too, which would
> really let you isolate the special handling to a simple SELECT block:
> SELECT v._tmp1 c1, v._tmp2 c2
> FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v
>
> Another question: what happens if the two arrays are from different table
> references? Is that allowed? Do they still get zipped together?
>
> On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
> > Hey All,
> > Thanks for taking a look and sharing your view. The whole idea of having
> > multiple UNNEST() functions in the select list (and zipping the results)
> > came from a user as a direct feature request as they try to move to
> Impala
> > from Greenplum and as they say they can't rewrite their existing queries.
> >
> > The most straightforward way would be to implement UNNEST() as a UDTF but
> > Impala doesn't support them, and anyway, as mentioned above Impala
> doesn't
> > support even UDFs with nested type parameters.
> >
> > My idea was something like introducing an unnest SlotRef or such, change
> > the SQL parser to translate each UNNEST(array) into this SlotRef and then
> > do a query rewrite based on that so that we can process the unnest. My
> > preference would be to rewrite the query to the syntax we have now with
> > joining the arrays in the FROM clause as collection table refs and then
> > have a feature flag (e.g. unnest_join_type) to indicate whether we want
> the
> > joined or the zipped behaviour. For the zipped one we can modify the
> > existing UNNEST node to be able to receive more arrays and then do the
> zip
> > itself.
> >
> > What do you think?
> >
> > Gabor
> >
> > On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pengdou1...@126.com> wrote:
> >
> > > Hi  All,
> > >
> > >
> > > Currently IMPALA don't support complex data type as input/output
> > > parameters for Bulltin Functions / User Defined Functions,
> > > If IMPALA can support this, the Unnest()  or other  operation on
> complex
> > > data type may easy to achieve.
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Best Regards
> > > Dou
> > >
> > >
> > >
> > >
> > >
> > > At 2021-09-13 15:08:52, "Quanlong Huang" <huangquanl...@gmail.com>
> > wrote:
> > > >This seems like the explode() UDTF in Hive:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
> > > >
> > > >Hive doesn't support multiple UDTF in the SelectList. I think we can
> > make
> > > >the same choice that only supports one UNNEST() in the SelectList. If
> > > >multiple columns need to be unfolded, they should be wrapped in the
> same
> > > >UNNEST() function with a keyword/func specifying how to join the rows.
> > > E.g.
> > > >
> > > >-- cross join by default
> > > >select UNNEST(array_a, array_b) from tbl;
> > > >-- specifying the join type
> > > >select UNNEST(zip array_a, array_b) from tbl;
> > > >select UNNEST(arrays_zip(array_a, array_b)) from tbl;
> > > >
> > > >explode() in Hive only supports one column. The above syntax is
> inspired
> > > by
> > > >SparkSQL:
> > > >
> > >
> >
> https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
> > > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
> > > >
> > > >On a separate note, what about using the same name(explode) as Hive? I
> > > >think it'd be helpful for migrations into Impala, especially if we
> also
> > > >support the LATERAL VIEW syntax.
> > > >
> > > >Thanks,
> > > >Quanlong
> > > >
> > > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <
> > tim.g.armstr...@gmail.com>
> > > >wrote:
> > > >
> > > >> The syntax/behaviour seems kinda unusual to me - do any other
> systems
> > > aside
> > > >> from postgres implement this? UNNEST() in the from clause seems more
> > > >> common, and it is generally not consistent with the rest of SQL for
> > > SELECT
> > > >> clause entries to affect the set of rows returned.
> > > >>
> > > >> I'm not sure but it seems like it might be tricky to make the select
> > > clause
> > > >> in the Impala frontend do this.
> > > >>
> > > >> Some way of doing this in FROM clause seems more intuitive and
> > > consistent
> > > >> with the rest of sql, e.g.
> > > >>
> > > >> select id, e1, e2
> > > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
> > > >>
> > > >> I'm tempted to say that it should be called ZIP() since that's the
> > name
> > > of
> > > >> this operation in functional programming, but I don't think
> inventing
> > > new
> > > >> SQL keywords is really ideal.
> > > >>
> > > >> I'm also a little curious about the use case for this. I think
> ideally
> > > you
> > > >> would represent two arrays with correlated indices as an array of
> > > structs
> > > >> instead of two arrays, but I can appreciate that schema designs may
> > not
> > > >> always be fully normalised.
> > > >>
> > > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <amansi...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hi Gabor,
> > > >> > An implicit assumption in the proposed syntax is that there is
> some
> > > >> > association between an item at a particular position in the first
> > > array
> > > >> and
> > > >> > the same position in the second array.
> > > >> > However,  in general the arrays are not ordered in a semantic
> sense.
> > > e.g
> > > >> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.
> >  The
> > > >> > proposed UNNEST behavior will not produce any rows where {Honda,
> > > Accord}
> > > >> > appear together which would be unexpected.  Although one could
> make
> > > the
> > > >> > argument that in order to get this behavior you could fall back to
> > the
> > > >> > current syntax of putting the arrays in the FROM clause, it could
> be
> > > >> > confusing to explain when one should be used vs other.  It would
> be
> > > >> useful
> > > >> > to see what other systems (apart from Postgres) which have more
> > native
> > > >> > complex types support do in this case (e.g Presto, Couchbase,
> > > Bigquery).
> > > >> >
> > > >> > One alternative I can think of is to make the association between
> > the
> > > >> items
> > > >> > in multiple arrays be more explicit e.g by using an outer join on
> > the
> > > >> > ordinal position. That way, items at the same position appear
> > together
> > > >> in a
> > > >> > row and for non-matches we produce Null for the smaller array. I
> > > haven't
> > > >> > thought through the exact syntax for this yet.
> > > >> >
> > > >> > -Aman
> > > >> >
> > > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <
> > > gaborkas...@cloudera.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hey,
> > > >> > >
> > > >> > > I'm working on implementing a new UNNEST() operation that does
> > what
> > > it
> > > >> > > says, unnests arrays. But here the trick would come when there
> are
> > > >> > multiple
> > > >> > > UNNESTS() in the select list. Let me explain the desired
> behaviour
> > > with
> > > >> > an
> > > >> > > example:
> > > >> > >
> > > >> > > Let's assume the following table:
> > > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
> > > >> > > An the following query:
> > > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
> > > >> > >
> > > >> > > Here the expected result is to not produce a join on the two
> > arrays
> > > as
> > > >> we
> > > >> > > would with the original way of giving them in the select lit,
> but
> > to
> > > >> put
> > > >> > > the values of the arrays "next to each other" like this:
> > > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
> > > >> > > I have investigated some options how this could be implemented
> and
> > > >> > created
> > > >> > > a doc for my notes. Let me share this with the community so that
> > > people
> > > >> > can
> > > >> > > share their preference. Any feedback is welcome!
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
> > > >> > >
> > > >> > > Cheers,
> > > >> > > Gabor
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: Re: UNNEST() operation for arrays

Reply via email to