Thanks for responding in detail, I agree with your points and it's interesting that the SQL standard has something similar - I didn't know that.
> I checked in Postgres and apparently you can unnest arrays from multiple > tables and they still get zipped. I think for the first iteration this > would be overkill to implement as Impala puts the arrays from different > tables into a different subplan making it a bit more complex to implement, > so I'd just restrict unnest() to a single table at this point. Agreed! I'm not sure why you would want to do this - maybe an unusual schema design. On Tue, 28 Sept 2021 at 06:07, Gabor Kaszab <gaborkas...@cloudera.com> wrote: > Thanks for sharing your view, Tim! And sorry for the late answer. > > *Feature flag:* > The more thought I give this the more I feel that using a feature flag > indeed would be confusing. Additionally, I like the idea of having a > special syntax for the zipping unnest. > > *Syntax:* > I checked the SQL standard and it says something really similar to your > suggestion: > Example from the standard: > SELECT M.E FROM UNNEST (MVE) AS M(E) > > <collection derived table> ::= > UNNEST <left paren> <collection value expression> > [ { <comma> <collection value expression> }... ] <right paren> [ WITH > ORDINALITY ] > > And you can specify aliases for the <collection derived table>s with the AS > keyword and even the standard mentions multiple collection parameters for > the unnest(). So let's have the query rewrite this way: > SELECT UNNEST(t.a1), UNNEST(t.a2) FROM t; > ==> > SELECT _tmp1, _tmp2 FROM t, UNNEST(t.a1, t.a2) AS (_tmp1, tmp2); > > *Arrays from multiple tables:* > I checked in Postgres and apparently you can unnest arrays from multiple > tables and they still get zipped. I think for the first iteration this > would be overkill to implement as Impala puts the arrays from different > tables into a different subplan making it a bit more complex to implement, > so I'd just restrict unnest() to a single table at this point. > > *UnnestTableRef:* > I also thought that these unnests in the FROM clause should go into a kind > of TableRef (e.g. UnnestTableRef?, ZippingUnnestTableRef) but there I found > that each TableRef has a single rawPath_ and this might not be the best for > a single TableRef descendant holding multiple arrays with multiple > rawPath_s. But this is not a big issue, I'm sure there is another way to > get around this. > > Cheers, > Gabor > > > On Tue, Sep 14, 2021 at 7:36 AM Tim Armstrong <tim.g.armstr...@gmail.com> > wrote: > > > I think compatibility with Postgres derivatives generally makes sense as > a > > goal, even if the behaviour isn't ideal - definitely much easier than > > rewriting everything to a different syntax. > > > > With your rewrite proposal, would you also be able to express these > queries > > by unnesting in the FROM clause? > > > > SELECT unnest(t.a) c1, unnest(t.b) c1 > > FROM tbl t > > => > > SELECT _tmp1 c1, _tmp2 c2 > > FROM tbl t, UNNEST(t.a, t.b) (_tmp1, _tmp2) > > > > I'm not sure a feature flag would work well for this, since you > > might potentially have a more complex FROM clause with other unnesting > and > > I'm not sure how you would model that. I suspect it works out cleaner to > > model the zipped arrays as a TableRef - for most purposes in analysis and > > planning it's going to be treated the same as a TableRef that is just a > > plain array of structs - the analysis/planner code for TableRefs, joins, > > etc isn't the easiest to extend. > > > > I guess you could also rewrite it to something like this too, which would > > really let you isolate the special handling to a simple SELECT block: > > SELECT v._tmp1 c1, v._tmp2 c2 > > FROM tbl t, (SELECT t.a __tmp1, t.b __tmp2 FROM t.a ZIP JOIN t.b) v > > > > Another question: what happens if the two arrays are from different table > > references? Is that allowed? Do they still get zipped together? > > > > On Mon, 13 Sept 2021 at 04:43, Gabor Kaszab <gaborkas...@apache.org> > > wrote: > > > > > Hey All, > > > Thanks for taking a look and sharing your view. The whole idea of > having > > > multiple UNNEST() functions in the select list (and zipping the > results) > > > came from a user as a direct feature request as they try to move to > > Impala > > > from Greenplum and as they say they can't rewrite their existing > queries. > > > > > > The most straightforward way would be to implement UNNEST() as a UDTF > but > > > Impala doesn't support them, and anyway, as mentioned above Impala > > doesn't > > > support even UDFs with nested type parameters. > > > > > > My idea was something like introducing an unnest SlotRef or such, > change > > > the SQL parser to translate each UNNEST(array) into this SlotRef and > then > > > do a query rewrite based on that so that we can process the unnest. My > > > preference would be to rewrite the query to the syntax we have now with > > > joining the arrays in the FROM clause as collection table refs and then > > > have a feature flag (e.g. unnest_join_type) to indicate whether we want > > the > > > joined or the zipped behaviour. For the zipped one we can modify the > > > existing UNNEST node to be able to receive more arrays and then do the > > zip > > > itself. > > > > > > What do you think? > > > > > > Gabor > > > > > > On Mon, Sep 13, 2021 at 9:35 AM 彭斗 <pengdou1...@126.com> wrote: > > > > > > > Hi All, > > > > > > > > > > > > Currently IMPALA don't support complex data type as input/output > > > > parameters for Bulltin Functions / User Defined Functions, > > > > If IMPALA can support this, the Unnest() or other operation on > > complex > > > > data type may easy to achieve. > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best Regards > > > > Dou > > > > > > > > > > > > > > > > > > > > > > > > At 2021-09-13 15:08:52, "Quanlong Huang" <huangquanl...@gmail.com> > > > wrote: > > > > >This seems like the explode() UDTF in Hive: > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array) > > > > > > > > > >Hive doesn't support multiple UDTF in the SelectList. I think we can > > > make > > > > >the same choice that only supports one UNNEST() in the SelectList. > If > > > > >multiple columns need to be unfolded, they should be wrapped in the > > same > > > > >UNNEST() function with a keyword/func specifying how to join the > rows. > > > > E.g. > > > > > > > > > >-- cross join by default > > > > >select UNNEST(array_a, array_b) from tbl; > > > > >-- specifying the join type > > > > >select UNNEST(zip array_a, array_b) from tbl; > > > > >select UNNEST(arrays_zip(array_a, array_b)) from tbl; > > > > > > > > > >explode() in Hive only supports one column. The above syntax is > > inspired > > > > by > > > > >SparkSQL: > > > > > > > > > > > > > > > https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table > > > > >https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip > > > > > > > > > >On a separate note, what about using the same name(explode) as > Hive? I > > > > >think it'd be helpful for migrations into Impala, especially if we > > also > > > > >support the LATERAL VIEW syntax. > > > > > > > > > >Thanks, > > > > >Quanlong > > > > > > > > > >On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong < > > > tim.g.armstr...@gmail.com> > > > > >wrote: > > > > > > > > > >> The syntax/behaviour seems kinda unusual to me - do any other > > systems > > > > aside > > > > >> from postgres implement this? UNNEST() in the from clause seems > more > > > > >> common, and it is generally not consistent with the rest of SQL > for > > > > SELECT > > > > >> clause entries to affect the set of rows returned. > > > > >> > > > > >> I'm not sure but it seems like it might be tricky to make the > select > > > > clause > > > > >> in the Impala frontend do this. > > > > >> > > > > >> Some way of doing this in FROM clause seems more intuitive and > > > > consistent > > > > >> with the rest of sql, e.g. > > > > >> > > > > >> select id, e1, e2 > > > > >> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2) > > > > >> > > > > >> I'm tempted to say that it should be called ZIP() since that's the > > > name > > > > of > > > > >> this operation in functional programming, but I don't think > > inventing > > > > new > > > > >> SQL keywords is really ideal. > > > > >> > > > > >> I'm also a little curious about the use case for this. I think > > ideally > > > > you > > > > >> would represent two arrays with correlated indices as an array of > > > > structs > > > > >> instead of two arrays, but I can appreciate that schema designs > may > > > not > > > > >> always be fully normalised. > > > > >> > > > > >> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <amansi...@gmail.com> > > > wrote: > > > > >> > > > > >> > Hi Gabor, > > > > >> > An implicit assumption in the proposed syntax is that there is > > some > > > > >> > association between an item at a particular position in the > first > > > > array > > > > >> and > > > > >> > the same position in the second array. > > > > >> > However, in general the arrays are not ordered in a semantic > > sense. > > > > e.g > > > > >> > car_make: {"Honda", "Toyota"}, car_model: {"Prius", "Accord"}. > > > The > > > > >> > proposed UNNEST behavior will not produce any rows where {Honda, > > > > Accord} > > > > >> > appear together which would be unexpected. Although one could > > make > > > > the > > > > >> > argument that in order to get this behavior you could fall back > to > > > the > > > > >> > current syntax of putting the arrays in the FROM clause, it > could > > be > > > > >> > confusing to explain when one should be used vs other. It would > > be > > > > >> useful > > > > >> > to see what other systems (apart from Postgres) which have more > > > native > > > > >> > complex types support do in this case (e.g Presto, Couchbase, > > > > Bigquery). > > > > >> > > > > > >> > One alternative I can think of is to make the association > between > > > the > > > > >> items > > > > >> > in multiple arrays be more explicit e.g by using an outer join > on > > > the > > > > >> > ordinal position. That way, items at the same position appear > > > together > > > > >> in a > > > > >> > row and for non-matches we produce Null for the smaller array. I > > > > haven't > > > > >> > thought through the exact syntax for this yet. > > > > >> > > > > > >> > -Aman > > > > >> > > > > > >> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab < > > > > gaborkas...@cloudera.com> > > > > >> > wrote: > > > > >> > > > > > >> > > Hey, > > > > >> > > > > > > >> > > I'm working on implementing a new UNNEST() operation that does > > > what > > > > it > > > > >> > > says, unnests arrays. But here the trick would come when there > > are > > > > >> > multiple > > > > >> > > UNNESTS() in the select list. Let me explain the desired > > behaviour > > > > with > > > > >> > an > > > > >> > > example: > > > > >> > > > > > > >> > > Let's assume the following table: > > > > >> > > [image: Screenshot 2021-09-10 at 19.58.53.png] > > > > >> > > An the following query: > > > > >> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl; > > > > >> > > > > > > >> > > Here the expected result is to not produce a join on the two > > > arrays > > > > as > > > > >> we > > > > >> > > would with the original way of giving them in the select lit, > > but > > > to > > > > >> put > > > > >> > > the values of the arrays "next to each other" like this: > > > > >> > > [image: Screenshot 2021-09-10 at 19.59.04.png] > > > > >> > > I have investigated some options how this could be implemented > > and > > > > >> > created > > > > >> > > a doc for my notes. Let me share this with the community so > that > > > > people > > > > >> > can > > > > >> > > share their preference. Any feedback is welcome! > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing > > > > >> > > > > > > >> > > Cheers, > > > > >> > > Gabor > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > >