dht7 opened a new issue, #9042:
URL: https://github.com/apache/hudi/issues/9042
**Describe the problem you faced**
Apache Hudi tables created using CTAS command in Spark-SQL with array column
types fail to insert overwrite data.
**To Reproduce**
Steps to reproduce the behavior:
1. Run the following create statement via Spark-SQL (or Spark-Thrift server):
```
create table test_database.test_table using hudi options (type="cow",
primaryKey='id', hoodie.table.name="test_table") location '<GCS bucket
location>' as select id, array_column1 from test_database.source_table;
```
Here, source_table is also a Hudi table.
2. Once the table is created, try to update the data by running the
following query:
```
insert overwrite table test_database.test_table select id, array_column1
FROM test_database.source_table;
```
Running the above query results into org.apache.spark.sql.AnalysisException
(complete stack-trace attached below).
**Expected behavior**
Insert overwrite query should get executed on the target table without any
issues/exceptions.
**Environment Description**
* Hudi version : 0.12.2
* Spark version : 3.1.2
* Storage (HDFS/S3/GCS..) : GCS
* Running on Docker? (yes/no) : no
**Additional context**
* If table is created using a simple create statement, we do not face the
issue while executing insert overwrite query.
```
create table test_database.test_table (id INT, array_column1 ARRAY<INT>)
using hudi options (type="cow", primaryKey='id',
hoodie.table.name="test_table") location '<GCS bucket location>';
```
* Registered avro schema of test_table :
```
{
"type" : "record",
"name" : "test_table_record",
"namespace" : "hoodie.test_table",
"fields" : [ {
"name" : "id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "array_column1",
"type" : [ "null", {
"type" : "array",
"items" : [ "int", "null" ]
} ],
"default" : null
} ]
}
```
**Stacktrace**
```
Error: org.apache.hive.service.cli.HiveSQLException: Error running query:
org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table
'test_database.test_table':
- Cannot write nullable elements to array of non-nulls: 'array_column1'
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:362)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:264)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
at
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:264)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:273)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.sql.AnalysisException: Cannot write incompatible
data to table 'test_database.test_table':
- Cannot write nullable elements to array of non-nulls: 'array_column1'
at
org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:73)
at
org.apache.spark.sql.HoodieSpark3CatalystPlanUtils.resolveOutputColumns(HoodieSpark3CatalystPlanUtils.scala:45)
at
org.apache.spark.sql.HoodieSpark3CatalystPlanUtils.resolveOutputColumns$(HoodieSpark3CatalystPlanUtils.scala:40)
at
org.apache.spark.sql.HoodieSpark31CatalystPlanUtils$.resolveOutputColumns(HoodieSpark31CatalystPlanUtils.scala:25)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.coerceQueryOutputColumns(InsertIntoHoodieTableCommand.scala:164)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignQueryOutput(InsertIntoHoodieTableCommand.scala:145)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:99)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:60)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:326)
... 16 more (state=,code=0)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]