Re:Re: UNNEST() operation for arrays

彭斗 Mon, 13 Sep 2021 00:35:28 -0700

Hi  All,


Currently IMPALA don't support complex data type as input/output parameters for 
Bulltin Functions / User Defined Functions,
If IMPALA can support this, the Unnest()  or other  operation on  complex data 
type may easy to achieve.




--

Best Regards
Dou





At 2021-09-13 15:08:52, "Quanlong Huang" <[email protected]> wrote:
>This seems like the explode() UDTF in Hive:
>https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode(array)
>
>Hive doesn't support multiple UDTF in the SelectList. I think we can make
>the same choice that only supports one UNNEST() in the SelectList. If
>multiple columns need to be unfolded, they should be wrapped in the same
>UNNEST() function with a keyword/func specifying how to join the rows. E.g.
>
>-- cross join by default
>select UNNEST(array_a, array_b) from tbl;
>-- specifying the join type
>select UNNEST(zip array_a, array_b) from tbl;
>select UNNEST(arrays_zip(array_a, array_b)) from tbl;
>
>explode() in Hive only supports one column. The above syntax is inspired by
>SparkSQL:
>https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table
>https://spark.apache.org/docs/latest/api/sql/index.html#arrays_zip
>
>On a separate note, what about using the same name(explode) as Hive? I
>think it'd be helpful for migrations into Impala, especially if we also
>support the LATERAL VIEW syntax.
>
>Thanks,
>Quanlong
>
>On Sun, Sep 12, 2021 at 8:56 AM Tim Armstrong <[email protected]>
>wrote:
>
>> The syntax/behaviour seems kinda unusual to me - do any other systems aside
>> from postgres implement this? UNNEST() in the from clause seems more
>> common, and it is generally not consistent with the rest of SQL for SELECT
>> clause entries to affect the set of rows returned.
>>
>> I'm not sure but it seems like it might be tricky to make the select clause
>> in the Impala frontend do this.
>>
>> Some way of doing this in FROM clause seems more intuitive and consistent
>> with the rest of sql, e.g.
>>
>> select id, e1, e2
>> from tbl t, UNNEST(t.arr1, t.arr2) (e1, e2)
>>
>> I'm tempted to say that it should be called ZIP() since that's the name of
>> this operation in functional programming, but I don't think inventing new
>> SQL keywords is really ideal.
>>
>> I'm also a little curious about the use case for this. I think ideally you
>> would represent two arrays with correlated indices as an array of structs
>> instead of two arrays, but I can appreciate that schema designs may not
>> always be fully normalised.
>>
>> On Fri, 10 Sept 2021 at 18:38, Aman Sinha <[email protected]> wrote:
>>
>> > Hi Gabor,
>> > An implicit assumption in the proposed syntax is that there is some
>> > association between an item at a particular position in the first array
>> and
>> > the same position in the second array.
>> > However,  in general the arrays are not ordered in a semantic sense.  e.g
>> > car_make: {"Honda", "Toyota"},   car_model: {"Prius", "Accord"}.   The
>> > proposed UNNEST behavior will not produce any rows where {Honda, Accord}
>> > appear together which would be unexpected.  Although one could make the
>> > argument that in order to get this behavior you could fall back to the
>> > current syntax of putting the arrays in the FROM clause, it could be
>> > confusing to explain when one should be used vs other.  It would be
>> useful
>> > to see what other systems (apart from Postgres) which have more native
>> > complex types support do in this case (e.g Presto, Couchbase, Bigquery).
>> >
>> > One alternative I can think of is to make the association between the
>> items
>> > in multiple arrays be more explicit e.g by using an outer join on the
>> > ordinal position. That way, items at the same position appear together
>> in a
>> > row and for non-matches we produce Null for the smaller array. I haven't
>> > thought through the exact syntax for this yet.
>> >
>> > -Aman
>> >
>> > On Fri, Sep 10, 2021 at 11:04 AM Gabor Kaszab <[email protected]>
>> > wrote:
>> >
>> > > Hey,
>> > >
>> > > I'm working on implementing a new UNNEST() operation that does what it
>> > > says, unnests arrays. But here the trick would come when there are
>> > multiple
>> > > UNNESTS() in the select list. Let me explain the desired behaviour with
>> > an
>> > > example:
>> > >
>> > > Let's assume the following table:
>> > > [image: Screenshot 2021-09-10 at 19.58.53.png]
>> > > An the following query:
>> > > SELECT id, UNNEST(arr1), UNNEST(arr2) FROM tbl;
>> > >
>> > > Here the expected result is to not produce a join on the two arrays as
>> we
>> > > would with the original way of giving them in the select lit, but to
>> put
>> > > the values of the arrays "next to each other" like this:
>> > > [image: Screenshot 2021-09-10 at 19.59.04.png]
>> > > I have investigated some options how this could be implemented and
>> > created
>> > > a doc for my notes. Let me share this with the community so that people
>> > can
>> > > share their preference. Any feedback is welcome!
>> > >
>> > >
>> > >
>> >
>> https://docs.google.com/document/d/184EKJwMME4SNzyfOTueI-nz-IL-WUiBeaS8Zhi2XzMo/edit?usp=sharing
>> > >
>> > > Cheers,
>> > > Gabor
>> > >
>> > >
>> > >
>> >
>>

Re:Re: UNNEST() operation for arrays

Reply via email to