Vinoth Govindarajan created HUDI-3748:
-----------------------------------------

             Summary: Hudi fails to insert into a partitioned table when the 
partition column is dropped from the parquet schema
                 Key: HUDI-3748
                 URL: https://issues.apache.org/jira/browse/HUDI-3748
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Vinoth Govindarajan


When you add this config to drop the partition column from the parquet schema 
to support BigQuery, hudi fails to insert with the following error.

 

Steps to reproduce:

Log into spark-sql and execute the following commands:


{code:java}
create table bq_demo_partitioned_cow (
                                           id bigint,
                                           name string,
                                           price double,
                                           ts bigint,
                                           dt string
) using hudi
    partitioned by (dt)
    tblproperties (
                type = 'cow',
                primaryKey = 'id',
                preCombineField = 'ts',
                hoodie.datasource.write.drop.partition.columns = 'true'
            );


insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(1, 'a1', 
10, current_timestamp());
insert into bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 
20, current_timestamp()); {code}
Error:
{code:java}
22/03/29 20:58:02 INFO spark.SparkContext: Starting job: collect at 
HoodieSparkEngineContext.java:100
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Got job 63 (collect at 
HoodieSparkEngineContext.java:100) with 1 output partitions
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 131 
(collect at HoodieSparkEngineContext.java:100)
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Missing parents: List()
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting ResultStage 131 
(MapPartitionsRDD[235] at map at HoodieSparkEngineContext.java:100), which has 
no missing parents
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89 stored as values 
in memory (estimated size 71.9 KB, free 364.0 MB)
22/03/29 20:58:02 INFO memory.MemoryStore: Block broadcast_89_piece0 stored as 
bytes in memory (estimated size 26.3 KB, free 364.0 MB)
22/03/29 20:58:02 INFO storage.BlockManagerInfo: Added broadcast_89_piece0 in 
memory on adhoc-1:38703 (size: 26.3 KB, free: 365.7 MB)
22/03/29 20:58:02 INFO spark.SparkContext: Created broadcast 89 from broadcast 
at DAGScheduler.scala:1161
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
ResultStage 131 (MapPartitionsRDD[235] at map at 
HoodieSparkEngineContext.java:100) (first 15 tasks are for partitions Vector(0))
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Adding task set 131.0 with 
1 tasks
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
131.0 (TID 4081, localhost, executor driver, partition 0, PROCESS_LOCAL, 7803 
bytes)
22/03/29 20:58:02 INFO executor.Executor: Running task 0.0 in stage 131.0 (TID 
4081)
22/03/29 20:58:02 INFO executor.Executor: Finished task 0.0 in stage 131.0 (TID 
4081). 1167 bytes result sent to driver
22/03/29 20:58:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
131.0 (TID 4081) in 17 ms on localhost (executor driver) (1/1)
22/03/29 20:58:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 131.0, 
whose tasks have all completed, from pool
22/03/29 20:58:02 INFO scheduler.DAGScheduler: ResultStage 131 (collect at 
HoodieSparkEngineContext.java:100) finished in 0.030 s
22/03/29 20:58:02 INFO scheduler.DAGScheduler: Job 63 finished: collect at 
HoodieSparkEngineContext.java:100, took 0.032364 s
22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220329205734338__commit__COMPLETED]}
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: Took 0 ms to read  0 
instants, 0 replaced file groups
22/03/29 20:58:02 INFO util.ClusteringUtils: Found 0 files in pending 
clustering operations
22/03/29 20:58:02 INFO view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=0
22/03/29 20:58:02 INFO hudi.HoodieFileIndex: Refresh table 
bq_demo_partitioned_cow, spend: 124 ms
22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow
22/03/29 20:58:02 INFO table.HoodieTableConfig: Loading table properties from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow/.hoodie/hoodie.properties
22/03/29 20:58:02 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://namenode:8020/user/hive/warehouse/bq_demo_partitioned_cow
22/03/29 20:58:02 INFO timeline.HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220329205734338__commit__COMPLETED]}
22/03/29 20:58:02 INFO command.InsertIntoHoodieTableCommand: insert statement 
use write operation type: upsert, payloadClass: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
22/03/29 20:58:02 ERROR thriftserver.SparkSQLDriver: Failed in [insert into 
bq_demo_partitioned_cow partition(dt='2021-12-25') values(2, 'a2', 20, 
current_timestamp())]
java.lang.AssertionError: assertion failed: Required partition columns is: 
{"type":"struct","fields":[]}, Current static partitions is: dt -> 2021-12-25
    at scala.Predef$.assert(Predef.scala:170)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:130)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:94)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:55)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.AssertionError: assertion failed: Required partition columns is: 
{"type":"struct","fields":[]}, Current static partitions is: dt -> 2021-12-25
    at scala.Predef$.assert(Predef.scala:170)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:130)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:94)
    at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:55)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
    at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)spark-sql> 
22/03/29 21:00:38 INFO hfile.LruBlockCache: totalSize=383.75 KB, 
freeSize=363.83 MB, max=364.20 MB, blockCount=0, accesses=4, hits=0, 
hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=59, 
evicted=0, evictedPerRun=0.0 {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to