hudi-bot opened a new issue, #17355:
URL: https://github.com/apache/hudi/issues/17355

   the attached is the table folder ^ after merge into of the below query suite 
vv
   spark 3.4. hudi fork for 14.1 and oss have such issue.
   Impact: MIT does not work when column type is double in some cases. No data 
corruption.
   I don't think there is an easy alternative except deleting records and 
re-insert new ones, especially for cross partition updates.
    
   CREATE TABLE hudi_global_index_mor ( record_key STRING, name STRING, age 
INT, department STRING, salary DOUBLE, ts BIGINT ) USING hudi PARTITIONED BY 
(department) LOCATION 
's3a://onehouse-customer-bucket-fb543790/lakes/observed-default/sql_global_bloom_filter_db_3c97332a_1732579795651/hudi_global_index_mor/'
 TBLPROPERTIES ( type = 'mor', 'hoodie.index.type' = 'GLOBAL_SIMPLE', 
'hoodie.index.global.index.enable' = 'true', 'hoodie.bloom.index.use.metadata' 
= 'true', primaryKey = 'record_key', preCombineField = 'ts' ) ; INSERT INTO 
hudi_global_index_mor SELECT * FROM ( SELECT 'emp_001' as record_key, 'John 
Doe' as name, 30 as age, 'Sales' as department, 80000.0 as salary, 1598886000 
as ts UNION ALL SELECT 'emp_002', 'Jane Smith', 28, 'Sales', 75000.0, 
1598886001 UNION ALL SELECT 'emp_003', 'Bob Wilson', 35, 'Marketing', 85000.0, 
1598886002 ) ; UPDATE hudi_global_index_mor SET department = 'Sales', salary = 
90000.0, ts = 1598886100 WHERE record_key = 'emp_001' ; SELECT record_key, name
 , age, department, salary, ts FROM hudi_global_index_mor ORDER BY record_key ; 
CREATE OR REPLACE TEMPORARY VIEW source_updates AS SELECT * FROM ( SELECT 
'emp_001' as record_key, 'John Doe' as name, 30 as age, 'Engineering' as 
department, 95000.0 as salary, cast(1598886200 as BIGINT) as ts UNION ALL 
SELECT 'emp_004', 'Alice Brown', 29, 'Engineering', 82000.0, cast(1598886201 as 
BIGINT) ) ; MERGE INTO hudi_global_index_mor t USING source_updates s ON 
t.record_key = s.record_key WHEN MATCHED THEN UPDATE SET department = 
s.department, salary = s.salary, ts = s.ts WHEN NOT MATCHED THEN INSERT * ;
    
    
    
   Error
   Caused by: org.apache.avro.AvroTypeException: Found double, expecting 
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at 
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at 
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at 
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
 at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
 at org.apache.avro.generic.Gene
 ricDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at
   From the timeline we clearly see the column salary is "double"
   
   { "name": "salary", "type": [ "null", "double" ], "default": null }
   
   ,
   Also checked the parquet file, I didn't see anything abnormal.
   parquet.avro.schema: 
{"type":"record","name":"hudi_global_index_mor_record","namespace":"hoodie.hudi_global_index_mor","fields":[
   
   
{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null}
   
   ,
   
   
{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null}
   
   ,
   
   
{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null}
   
   ,
   
   
{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null}
   
   ,
   
   {"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null}
   
   ,
   
   {"name":"record_key","type":["null","string"],"default":null}
   
   ,
   
   {"name":"name","type":["null","string"],"default":null}
   
   ,
   
   {"name":"age","type":["null","int"],"default":null}
   
   ,
   
   {"name":"salary","type":["null","double"],"default":null}
   
   ,
   
   {"name":"ts","type":["null","long"],"default":null}
   
   ,
   
   {"name":"department","type":["null","string"],"default":null}
   
   ]} writer.model.name: avro hoodie_max_record_key: emp_003 Schema: message 
hoodie.hudi_global_index_mor.hudi_global_index_mor_record \{ optional binary 
_hoodie_commit_time (STRING); optional binary _hoodie_commit_seqno (STRING); 
optional binary _hoodie_record_key (STRING); optional binary 
_hoodie_partition_path (STRING); optional binary _hoodie_file_name (STRING); 
optional binary record_key (STRING); optional binary name (STRING); optional 
int32 age; optional double salary; optional int64 ts; optional binary 
department (STRING); } Row group 0: count: 1 728.00 B records start: 4 
total(compressed): 728 B total(uncompressed):540 B 
--------------------------------------------------------------------------------
 type encodings count avg size nulls min / max _hoodie_commit_time BINARY G _ 1 
68.00 B 0 "20241126001135153" / "20241126001135153" _hoodie_commit_seqno BINARY 
G _ 1 72.00 B 0 "20241126001135153_1_0" / "20241126001135153_1_0" 
_hoodie_record_key BINARY G _ 1 58.00 B 0 "emp_003" / 
 "emp_003" _hoodie_partition_path BINARY G _ 1 71.00 B 0 "department=Marketing" 
/ "department=Marketing" _hoodie_file_name BINARY G _ 1 118.00 B 0 
"31aa5fb4-9732-4376-bf6c-3..." / "31aa5fb4-9732-4376-bf6c-3..." record_key 
BINARY G _ 1 58.00 B 0 "emp_003" / "emp_003" name BINARY G _ 1 61.00 B 0 "Bob 
Wilson" / "Bob Wilson" age INT32 G _ 1 51.00 B 0 "35" / "35" salary DOUBLE G _ 
1 56.00 B 0 "85000.0" / "85000.0" ts INT64 G _ 1 55.00 B 0 "1598886002" / 
"1598886002" department BINARY G _ 1 60.00 B 0 "Marketing" / "Marketing"
    
   *One theory is the code path trying to parse the double value in DECIMAL, 
whose underlying avro type is either bytes or fixed.*
   h3. *Decimal*
   
   The decimal logical type represents an arbitrary-precision signed decimal 
number of the form {_}unscaled × 10{_}{^}_-scale_{^}.
   A decimal logical type annotates Avro *bytes* or *fixed* types. The byte 
array must contain the two's-complement representation of the unscaled integer 
value in big-endian byte order. The scale is fixed, and is specified using an 
attribute.
    
    
   full stack trace
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert 
for commit time 20241126001204349 at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:80)
 at 
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:47)
 at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:98)
 at 
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:88)
 at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:162) 
at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:221) 
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:484)
 at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:482)
 at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:180)
 at org.apache.hudi.HoodieSparkSqlW
 riter$.write(HoodieSparkSqlWriter.scala:118) at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:421)
 at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:264)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala
 :103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
 at org.apache.spark.sq
 
l.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) 
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
 at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
 at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
 at org.apache.spark.sql.Dataset.<init>(Dataset.scala:218) at 
org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at 
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:95) at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.s
 cala:640) at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at 
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630) at 
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:226)
 ... 16 more Caused by: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 1 in stage 4579.0 failed 4 times, most recent failure: Lost 
task 1.3 in stage 4579.0 (TID 9325) (10.0.91.170 executor 1): 
org.apache.hudi.exception.HoodieException: Exception when reading log file at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:359)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:218
 ) at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:203)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:119)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:77)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:465)
 at 
org.apache.hudi.io.HoodieMergedReadHandle.getLogRecordScanner(HoodieMergedReadHandle.java:148)
 at 
org.apache.hudi.io.HoodieMergedReadHandle.getMergedRecords(HoodieMergedReadHandle.java:89)
 at 
org.apache.hudi.index.HoodieIndexUtils.lambda$getExistingRecords$c7e45d15$1(HoodieIndexUtils.java:246)
 at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:137)
 at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125) 
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at sc
 ala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at 
org.apache.spark.scheduler.Task.run(Task.scala:139) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
  java.base/java.lang.Thread.run(Unknown Source) Caused by: 
org.apache.avro.AvroTypeException: Found double, expecting 
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at 
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at 
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at 
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
 at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumRe
 ader.java:180) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:223)
 at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:168)
 at 
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
 at 
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:613)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:654)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:349)
 ... 25 more Driver stacktrace: at org.apac
 
he.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2790)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2726)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2725)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2725) at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1211)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1211)
 at scala.Option.foreach(Option.scala:407) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1211)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.sc
 ala:2989) at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2928)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2917)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:976) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2284) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2328) at 
org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1022) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:408) at 
org.apache.spark.rdd.RDD.collect(RDD.scala:1021) at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(Pai
 rRDDFunctions.scala:367) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:408) at 
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:367) at 
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314) at 
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:105) 
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:220)
 at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:186)
 at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:90)
 at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:74)
 ... 60 more Caused by: org.apache.hudi.exception.HoodieException: Exception wh
 en reading log file at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:359)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:218)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:203)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:119)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:77)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:465)
 at 
org.apache.hudi.io.HoodieMergedReadHandle.getLogRecordScanner(HoodieMergedReadHandle.java:148)
 at 
org.apache.hudi.io.HoodieMergedReadHandle.getMergedRecords(HoodieMergedReadHandle.java:89)
 at 
org.apache.hudi.index.HoodieIndexUtils.lambda$getExistingRecords$c7e45d15$1(HoodieIndex
 Utils.java:246) at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:137)
 at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125) 
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at 
org.apache.spark.scheduler.Task.run(Task.scala:139) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
 at org.apache.spark.util.Utils$.tryWithSafeFi
 nally(Utils.scala:1529) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ... 3 
more Caused by: org.apache.avro.AvroTypeException: Found double, expecting 
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at 
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at 
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at 
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
 at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
 at org.apache.avro.gen
 eric.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:223)
 at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:168)
 at 
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
 at 
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:613)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:654)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecor
 dReader.java:349) ... 25 more at 
org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
 at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at 
com.zaxxer.hikari.pool.ProxyStatement.execute(ProxyStatement.java:94) at 
com.zaxxer.hikari.pool.HikariProxyStatement.execute(HikariProxyStatement.java) 
at 
integrationTests.sql.HiveConnection.runQueryAndFetchResultsInCsv(HiveConnection.java:102)
 ... 154 more
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-8825
   - Type: Sub-task
   - Parent: https://issues.apache.org/jira/browse/HUDI-9109
   - Fix version(s):
     - 1.1.0
   - Attachment(s):
     - 05/Jan/25 
23:29;daviszhang;itFailure.zip;https://issues.apache.org/jira/secure/attachment/13073784/itFailure.zip
   
   
   ---
   
   
   ## Comments
   
   30/Apr/25 18:08;shivnarayan;I tried to reproduce the issue
   
    
   
   
   CREATE TABLE hudi_global_index_mor 
   ( record_key STRING, 
   name STRING, 
   age INT, 
   department STRING, 
   salary DOUBLE, 
   ts BIGINT ) 
   USING hudi 
   PARTITIONED BY (department) 
   LOCATION 'file:///tmp/hudi-mor-mit' 
   TBLPROPERTIES ( type = 'mor', 'hoodie.index.type' = 'GLOBAL_SIMPLE', 
'hoodie.index.global.index.enable' = 'true', primaryKey = 'record_key', 
preCombineField = 'ts' ) ; 
   
   INSERT INTO hudi_global_index_mor 
   SELECT * FROM ( SELECT 'emp_001' as record_key, 'John Doe' as name, 30 as 
age, 'Sales' as department, 80000.0 as salary, 1598886000 as ts UNION ALL 
SELECT 'emp_002', 'Jane Smith', 28, 'Sales', 75000.0, 1598886001 UNION ALL 
SELECT 'emp_003', 'Bob Wilson', 35, 'Marketing', 85000.0, 1598886002 ) ;
   
   
   UPDATE hudi_global_index_mor 
   SET department = 'Sales', salary = 90000.0, ts = 1598886100 WHERE record_key 
= 'emp_001' ; 
   
    
   
   -> UPDATE command fails, that we can't assign partition col. So, that needs 
to be fix. 
   {code:java}
   UPDATE hudi_global_index_mor 
                      > SET department = 'Sales', salary = 90000.0, ts = 
1598886100 WHERE record_key = 'emp_001' ; 
   Detected disallowed assignment clause in UPDATE statement for partition 
field `department` for table `spark_catalog.default.hudi_global_index_mor`. 
Please remove the assignment clause to avoid the error.
   spark-sql (default)>  {code}
   So, had to retry w/o that col assignment 
   
   UPDATE hudi_global_index_mor SET salary = 90000.0, ts = 1598886100 WHERE 
record_key = 'emp_001' ;
   
    
   {code:java}
   select * from hudi_global_index_mor order by record_key ;
   25/04/30 11:01:17 WARN PositionBasedFileGroupRecordBuffer: Falling back to 
key based merge for Read
   25/04/30 11:01:17 WARN PositionBasedFileGroupRecordBuffer: Falling back to 
key based merge for Read
   20250430110100330    20250430110100330_0_1   emp_001 department=Sales        
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0  emp_001 John Doe        30      90000.0 
1598886100      Sales
   20250430105611180    20250430105611180_0_1   emp_002 department=Sales        
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0_0-36-339_20250430105611180.parquet       
emp_002 Jane Smith      28      75000.0 1598886001      Sales
   20250430105611180    20250430105611180_1_0   emp_003 department=Marketing    
06783ae0-b1dd-40a4-be55-ee98e8f72e03-0_1-36-340_20250430105611180.parquet       
emp_003 Bob Wilson      35      85000.0 1598886002      Marketing
   Time taken: 0.431 seconds, Fetched 3 row(s) {code}
   
   CREATE OR REPLACE TEMPORARY VIEW source_updates AS SELECT * FROM ( SELECT 
'emp_001' as record_key, 'John Doe' as name, 30 as age, 'Engineering' as 
department, 95000.0 as salary, cast(1598886200 as BIGINT) as ts UNION ALL 
SELECT 'emp_004', 'Alice Brown', 29, 'Engineering', 82000.0, cast(1598886201 as 
BIGINT) ) ; 
   
   select * from source_updates;
   
   emp_001 John Doe 30 Engineering 95000.0 1598886200
   
   emp_004 Alice Brown 29 Engineering 82000.0 1598886201
   
   Time taken: 0.119 seconds, Fetched 2 row(s)
   
    
   
   MIT is expected to have 3 record key as mandatory field. above Jira desc did 
not have one. 
   {code:java}
   MERGE INTO hudi_global_index_mor t 
                      > USING source_updates s 
                      > ON t.record_key = s.record_key 
                      > WHEN MATCHED THEN UPDATE SET department = s.department, 
salary = s.salary, ts = s.ts 
                      > WHEN NOT MATCHED THEN INSERT * ;
   MERGE INTO field resolution error: No matching assignment found for target 
table record key field `record_key`
   spark-sql (default)>  {code}
   So, retried by setting record key 
   
   
   MERGE INTO hudi_global_index_mor t 
   USING source_updates s 
   ON t.record_key = s.record_key 
   WHEN MATCHED THEN UPDATE SET department = s.department, salary = s.salary, 
record_key = s.record_key, ts = s.ts 
   WHEN NOT MATCHED THEN INSERT * ;
   
    
   
   Command succeeded. 
   {code:java}
   MERGE INTO hudi_global_index_mor t 
                      > USING source_updates s 
                      > ON t.record_key = s.record_key 
                      > WHEN MATCHED THEN UPDATE SET department = s.department, 
salary = s.salary, record_key = s.record_key, ts = s.ts 
                      > WHEN NOT MATCHED THEN INSERT * ;
   25/04/30 11:03:32 WARN HoodieTableFileSystemView: Partition: 
department=Engineering is not available in store
   25/04/30 11:03:32 WARN HoodieLogBlock: There are records without valid 
positions. Skip writing record positions to the block header.
   25/04/30 11:03:33 WARN log: Updating partition stats fast for: 
hudi_global_index_mor_ro
   25/04/30 11:03:33 WARN log: Updated size to 435953
   25/04/30 11:03:33 WARN log: Updating partition stats fast for: 
hudi_global_index_mor_rt
   25/04/30 11:03:33 WARN log: Updated size to 435953
   25/04/30 11:03:34 WARN log: Updating partition stats fast for: 
hudi_global_index_mor
   25/04/30 11:03:34 WARN log: Updated size to 435953
   25/04/30 11:03:34 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
   25/04/30 11:03:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
does not exist
   25/04/30 11:03:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
does not exist
   Time taken: 3.65 seconds
   spark-sql (default)>  {code}
   Querying the table to confirm:
   {code:java}
     > select * from hudi_global_index_mor order by record_key ;
   25/04/30 11:03:46 WARN PositionBasedFileGroupRecordBuffer: Falling back to 
key based merge for Read
   25/04/30 11:03:46 WARN PositionBasedFileGroupRecordBuffer: Falling back to 
key based merge for Read
   20250430110331085    20250430110331085_1_0   emp_001 department=Engineering  
4ca675cc-7478-46a2-a555-d552638a380a-0_1-94-740_20250430110331085.parquet       
emp_001 NULL    NULL    95000.0 1598886200      Engineering
   20250430105611180    20250430105611180_0_1   emp_002 department=Sales        
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0_0-36-339_20250430105611180.parquet       
emp_002 Jane Smith      28      75000.0 1598886001      Sales
   20250430105611180    20250430105611180_1_0   emp_003 department=Marketing    
06783ae0-b1dd-40a4-be55-ee98e8f72e03-0_1-36-340_20250430105611180.parquet       
emp_003 Bob Wilson      35      85000.0 1598886002      Marketing
   20250430110331085    20250430110331085_1_1   emp_004 department=Engineering  
4ca675cc-7478-46a2-a555-d552638a380a-0_1-94-740_20250430110331085.parquet       
emp_004 Alice Brown     29      82000.0 1598886201      Engineering
   Time taken: 0.317 seconds, Fetched 4 row(s) {code}
    
   
   So, I could not reproduce the issue that user is reporting. 
   
   only diff is, I am using latest master. 
   
    
   
    ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to