Actually, the new type inference stack for UDFs is smart enough to solve this issue. It could derive a data type for the array from the surrounding call (expected data type).

So this can be supported with the right type inference logic: cast(ARRAY() as int)

Unfortunately, ARRAY is fully managed by Calcite and maybe deeply integrated also into the parser (at least this is the case for ROW). TBH if I were to design a FLIP for the collection functions, I would actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full control over the type inference in our stack. In our stack, this also means that NULL is unknown. Calcite distinguished between NULL and unknown.

So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() should result in ARRAY<NULL> if the type can not be derived by the surrounding call.

Regards,
Timo

On 28.10.22 03:46, yuxia wrote:
For an empty array, seems different engine use different data type:
Hive: string
Spark: string ?
Trino:  Unknown
BigQuery: Integer

I have tried with Hive and Spark, but haven't tried with Trino and BigQuery.

I'm a little of doubt about the spark's behavior. But from my sides, seems 
Spark actually use string type which is different from your investigation.
I try with the following sql in spark-cli:
`
select array() + 1
`

The exception is
`
Error in query: cannot resolve '(array() + 1)' due to data type mismatch: differing 
types in '(array() + 1)' (array<string> and int).; line 1 pos 7;
'Project [unresolvedalias((array() + 1), None)]
+- OneRowRelation
`


Seems it's hard to decide which data type Flink should use. I'm insterested in 
the reason why you would like to use Integer type.
I haven't cheked whether the sql stardard specifies it. But from my side, I 
prefer to follow Hive/Spark.

BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in Hive and 
Spark.


Best regards,
Yuxia

----- 原始邮件 -----
发件人: "eric xiao" <xiao.eric...@gmail.com>
收件人: "dev" <dev@flink.apache.org>
发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51
主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]

Hi,

I would like to propose a solution to this JIRA issue. I looked at the
comments and there was some guidance around where in the code we should
update to allow for this behaviour. But I believe there are still two
questions that remain open:

    1. Is this expected behaviour (i.e. users should not be able to create
    an empty array)?
    2. If this is indeed expected behaviour, what should the data type be of
    the empty array?

I did some digging into other query engines / databases in hopes of
answering the following two questions - That can be found at the end of
this thread.

*Q: *Is this expected behaviour (i.e. users should not be able to create an
empty array)?
*A: *Yes I would say this is expected behaviour and something we should add
into the Flink SQL API.

*Q: *What should the data type be of the empty array?
*A: *This question is a bit harder to answer and I think it would require
two steps.

*Step 1: Pick a default data type to initialize the empty array.*

We can use an "empty data type" such as NULL, VOID.

*Step 2: Create or reuse type coercion to make using empty arrays easier.*

The above should unblock users from creating empty arrays, but if one would
use an empty array in an COALESCE operation.

i.e. SELECT COALESCE(int_column, ARRAY[])

I believe they will get a query issue where the type for int_column (INTEGER)
and the empty array (NULL, VOID) do not match. Thus a user will need to
cast the empty array:

i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT))

as such to have the COALESCE query to execute successfully.

-----
*Trino*
EXPLAIN SELECT ARRAY[]
Fragment 0 [SINGLE]
     Output layout: [expr]
     Output partitioning: SINGLE []
     Output[columnNames = [_col0]]
     │   Layout: [expr:array(unknown)]
     │   Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}
     │   _col0 := expr
     └─ Values[]
            Layout: [expr:array(unknown)]
            Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}

  ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA=')))

Expected behaviour? *Yes.*
Array data type? *Unknown.*

*Spark*
sc.sql.sql("SELECT ARRAY[]").explain()
  DataFrame[array(): array<void>]

Expected behaviour? *Yes.*
Array data type? *Void.*

*BigQuery*
SELECT ARRAY[]
Field name Type          Mode
f0_              INTEGER  REPEATED

Expected behaviour? *Yes.*
Array data type? *Integer.*

Best,

Eric


Reply via email to