hudi-bot opened a new issue, #17355:
URL: https://github.com/apache/hudi/issues/17355
the attached is the table folder ^ after merge into of the below query suite
vv
spark 3.4. hudi fork for 14.1 and oss have such issue.
Impact: MIT does not work when column type is double in some cases. No data
corruption.
I don't think there is an easy alternative except deleting records and
re-insert new ones, especially for cross partition updates.
CREATE TABLE hudi_global_index_mor ( record_key STRING, name STRING, age
INT, department STRING, salary DOUBLE, ts BIGINT ) USING hudi PARTITIONED BY
(department) LOCATION
's3a://onehouse-customer-bucket-fb543790/lakes/observed-default/sql_global_bloom_filter_db_3c97332a_1732579795651/hudi_global_index_mor/'
TBLPROPERTIES ( type = 'mor', 'hoodie.index.type' = 'GLOBAL_SIMPLE',
'hoodie.index.global.index.enable' = 'true', 'hoodie.bloom.index.use.metadata'
= 'true', primaryKey = 'record_key', preCombineField = 'ts' ) ; INSERT INTO
hudi_global_index_mor SELECT * FROM ( SELECT 'emp_001' as record_key, 'John
Doe' as name, 30 as age, 'Sales' as department, 80000.0 as salary, 1598886000
as ts UNION ALL SELECT 'emp_002', 'Jane Smith', 28, 'Sales', 75000.0,
1598886001 UNION ALL SELECT 'emp_003', 'Bob Wilson', 35, 'Marketing', 85000.0,
1598886002 ) ; UPDATE hudi_global_index_mor SET department = 'Sales', salary =
90000.0, ts = 1598886100 WHERE record_key = 'emp_001' ; SELECT record_key, name
, age, department, salary, ts FROM hudi_global_index_mor ORDER BY record_key ;
CREATE OR REPLACE TEMPORARY VIEW source_updates AS SELECT * FROM ( SELECT
'emp_001' as record_key, 'John Doe' as name, 30 as age, 'Engineering' as
department, 95000.0 as salary, cast(1598886200 as BIGINT) as ts UNION ALL
SELECT 'emp_004', 'Alice Brown', 29, 'Engineering', 82000.0, cast(1598886201 as
BIGINT) ) ; MERGE INTO hudi_global_index_mor t USING source_updates s ON
t.record_key = s.record_key WHEN MATCHED THEN UPDATE SET department =
s.department, salary = s.salary, ts = s.ts WHEN NOT MATCHED THEN INSERT * ;
Error
Caused by: org.apache.avro.AvroTypeException: Found double, expecting
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
at
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.Gene
ricDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at
From the timeline we clearly see the column salary is "double"
{ "name": "salary", "type": [ "null", "double" ], "default": null }
,
Also checked the parquet file, I didn't see anything abnormal.
parquet.avro.schema:
{"type":"record","name":"hudi_global_index_mor_record","namespace":"hoodie.hudi_global_index_mor","fields":[
{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null}
,
{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null}
,
{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null}
,
{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null}
,
{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null}
,
{"name":"record_key","type":["null","string"],"default":null}
,
{"name":"name","type":["null","string"],"default":null}
,
{"name":"age","type":["null","int"],"default":null}
,
{"name":"salary","type":["null","double"],"default":null}
,
{"name":"ts","type":["null","long"],"default":null}
,
{"name":"department","type":["null","string"],"default":null}
]} writer.model.name: avro hoodie_max_record_key: emp_003 Schema: message
hoodie.hudi_global_index_mor.hudi_global_index_mor_record \{ optional binary
_hoodie_commit_time (STRING); optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING); optional binary
_hoodie_partition_path (STRING); optional binary _hoodie_file_name (STRING);
optional binary record_key (STRING); optional binary name (STRING); optional
int32 age; optional double salary; optional int64 ts; optional binary
department (STRING); } Row group 0: count: 1 728.00 B records start: 4
total(compressed): 728 B total(uncompressed):540 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max _hoodie_commit_time BINARY G _ 1
68.00 B 0 "20241126001135153" / "20241126001135153" _hoodie_commit_seqno BINARY
G _ 1 72.00 B 0 "20241126001135153_1_0" / "20241126001135153_1_0"
_hoodie_record_key BINARY G _ 1 58.00 B 0 "emp_003" /
"emp_003" _hoodie_partition_path BINARY G _ 1 71.00 B 0 "department=Marketing"
/ "department=Marketing" _hoodie_file_name BINARY G _ 1 118.00 B 0
"31aa5fb4-9732-4376-bf6c-3..." / "31aa5fb4-9732-4376-bf6c-3..." record_key
BINARY G _ 1 58.00 B 0 "emp_003" / "emp_003" name BINARY G _ 1 61.00 B 0 "Bob
Wilson" / "Bob Wilson" age INT32 G _ 1 51.00 B 0 "35" / "35" salary DOUBLE G _
1 56.00 B 0 "85000.0" / "85000.0" ts INT64 G _ 1 55.00 B 0 "1598886002" /
"1598886002" department BINARY G _ 1 60.00 B 0 "Marketing" / "Marketing"
*One theory is the code path trying to parse the double value in DECIMAL,
whose underlying avro type is either bytes or fixed.*
h3. *Decimal*
The decimal logical type represents an arbitrary-precision signed decimal
number of the form {_}unscaled × 10{_}{^}_-scale_{^}.
A decimal logical type annotates Avro *bytes* or *fixed* types. The byte
array must contain the two's-complement representation of the unscaled integer
value in big-endian byte order. The scale is fixed, and is specified using an
attribute.
full stack trace
Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert
for commit time 20241126001204349 at
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:80)
at
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:47)
at
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:98)
at
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:88)
at
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:162)
at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:221)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.liftedTree1$1(HoodieSparkSqlWriter.scala:484)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:482)
at
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:180)
at org.apache.hudi.HoodieSparkSqlW
riter$.write(HoodieSparkSqlWriter.scala:118) at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:421)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:264)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala
:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sq
l.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
at
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:218) at
org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98) at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:95) at
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.s
cala:640) at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) at
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:226)
... 16 more Caused by: org.apache.spark.SparkException: Job aborted due to
stage failure: Task 1 in stage 4579.0 failed 4 times, most recent failure: Lost
task 1.3 in stage 4579.0 (TID 9325) (10.0.91.170 executor 1):
org.apache.hudi.exception.HoodieException: Exception when reading log file at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:359)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:218
) at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:203)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:119)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:77)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:465)
at
org.apache.hudi.io.HoodieMergedReadHandle.getLogRecordScanner(HoodieMergedReadHandle.java:148)
at
org.apache.hudi.io.HoodieMergedReadHandle.getMergedRecords(HoodieMergedReadHandle.java:89)
at
org.apache.hudi.index.HoodieIndexUtils.lambda$getExistingRecords$c7e45d15$1(HoodieIndexUtils.java:246)
at
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:137)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at sc
ala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at
org.apache.spark.scheduler.Task.run(Task.scala:139) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
java.base/java.lang.Thread.run(Unknown Source) Caused by:
org.apache.avro.AvroTypeException: Found double, expecting
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
at
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumRe
ader.java:180) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:223)
at
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:168)
at
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
at
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:613)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:654)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:349)
... 25 more Driver stacktrace: at org.apac
he.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2790)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2726)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2725)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2725) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1211)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1211)
at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1211)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.sc
ala:2989) at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2928)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2917)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:976) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2263) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2284) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2328) at
org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1022) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:408) at
org.apache.spark.rdd.RDD.collect(RDD.scala:1021) at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(Pai
rRDDFunctions.scala:367) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:408) at
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:367) at
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314) at
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:105)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:220)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:186)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:90)
at
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:74)
... 60 more Caused by: org.apache.hudi.exception.HoodieException: Exception wh
en reading log file at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:359)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:218)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:203)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:119)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:77)
at
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:465)
at
org.apache.hudi.io.HoodieMergedReadHandle.getLogRecordScanner(HoodieMergedReadHandle.java:148)
at
org.apache.hudi.io.HoodieMergedReadHandle.getMergedRecords(HoodieMergedReadHandle.java:89)
at
org.apache.hudi.index.HoodieIndexUtils.lambda$getExistingRecords$c7e45d15$1(HoodieIndex
Utils.java:246) at
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:137)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at
org.apache.spark.scheduler.Task.run(Task.scala:139) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFi
nally(Utils.scala:1529) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ... 3
more Caused by: org.apache.avro.AvroTypeException: Found double, expecting
hoodie.hudi_global_index_mor.hudi_global_index_mor_record.salary.fixed at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at
org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at
org.apache.avro.io.ValidatingDecoder.checkFixed(ValidatingDecoder.java:134) at
org.apache.avro.io.ValidatingDecoder.readFixed(ValidatingDecoder.java:144) at
org.apache.avro.generic.GenericDatumReader.readFixed(GenericDatumReader.java:387)
at
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:190)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at org.apache.avro.gen
eric.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) at
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:223)
at
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:168)
at
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
at
org.apache.hudi.common.util.collection.MappingIterator.next(MappingIterator.java:44)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:613)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:654)
at
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecor
dReader.java:349) ... 25 more at
org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at
com.zaxxer.hikari.pool.ProxyStatement.execute(ProxyStatement.java:94) at
com.zaxxer.hikari.pool.HikariProxyStatement.execute(HikariProxyStatement.java)
at
integrationTests.sql.HiveConnection.runQueryAndFetchResultsInCsv(HiveConnection.java:102)
... 154 more
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-8825
- Type: Sub-task
- Parent: https://issues.apache.org/jira/browse/HUDI-9109
- Fix version(s):
- 1.1.0
- Attachment(s):
- 05/Jan/25
23:29;daviszhang;itFailure.zip;https://issues.apache.org/jira/secure/attachment/13073784/itFailure.zip
---
## Comments
30/Apr/25 18:08;shivnarayan;I tried to reproduce the issue
CREATE TABLE hudi_global_index_mor
( record_key STRING,
name STRING,
age INT,
department STRING,
salary DOUBLE,
ts BIGINT )
USING hudi
PARTITIONED BY (department)
LOCATION 'file:///tmp/hudi-mor-mit'
TBLPROPERTIES ( type = 'mor', 'hoodie.index.type' = 'GLOBAL_SIMPLE',
'hoodie.index.global.index.enable' = 'true', primaryKey = 'record_key',
preCombineField = 'ts' ) ;
INSERT INTO hudi_global_index_mor
SELECT * FROM ( SELECT 'emp_001' as record_key, 'John Doe' as name, 30 as
age, 'Sales' as department, 80000.0 as salary, 1598886000 as ts UNION ALL
SELECT 'emp_002', 'Jane Smith', 28, 'Sales', 75000.0, 1598886001 UNION ALL
SELECT 'emp_003', 'Bob Wilson', 35, 'Marketing', 85000.0, 1598886002 ) ;
UPDATE hudi_global_index_mor
SET department = 'Sales', salary = 90000.0, ts = 1598886100 WHERE record_key
= 'emp_001' ;
-> UPDATE command fails, that we can't assign partition col. So, that needs
to be fix.
{code:java}
UPDATE hudi_global_index_mor
> SET department = 'Sales', salary = 90000.0, ts =
1598886100 WHERE record_key = 'emp_001' ;
Detected disallowed assignment clause in UPDATE statement for partition
field `department` for table `spark_catalog.default.hudi_global_index_mor`.
Please remove the assignment clause to avoid the error.
spark-sql (default)> {code}
So, had to retry w/o that col assignment
UPDATE hudi_global_index_mor SET salary = 90000.0, ts = 1598886100 WHERE
record_key = 'emp_001' ;
{code:java}
select * from hudi_global_index_mor order by record_key ;
25/04/30 11:01:17 WARN PositionBasedFileGroupRecordBuffer: Falling back to
key based merge for Read
25/04/30 11:01:17 WARN PositionBasedFileGroupRecordBuffer: Falling back to
key based merge for Read
20250430110100330 20250430110100330_0_1 emp_001 department=Sales
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0 emp_001 John Doe 30 90000.0
1598886100 Sales
20250430105611180 20250430105611180_0_1 emp_002 department=Sales
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0_0-36-339_20250430105611180.parquet
emp_002 Jane Smith 28 75000.0 1598886001 Sales
20250430105611180 20250430105611180_1_0 emp_003 department=Marketing
06783ae0-b1dd-40a4-be55-ee98e8f72e03-0_1-36-340_20250430105611180.parquet
emp_003 Bob Wilson 35 85000.0 1598886002 Marketing
Time taken: 0.431 seconds, Fetched 3 row(s) {code}
CREATE OR REPLACE TEMPORARY VIEW source_updates AS SELECT * FROM ( SELECT
'emp_001' as record_key, 'John Doe' as name, 30 as age, 'Engineering' as
department, 95000.0 as salary, cast(1598886200 as BIGINT) as ts UNION ALL
SELECT 'emp_004', 'Alice Brown', 29, 'Engineering', 82000.0, cast(1598886201 as
BIGINT) ) ;
select * from source_updates;
emp_001 John Doe 30 Engineering 95000.0 1598886200
emp_004 Alice Brown 29 Engineering 82000.0 1598886201
Time taken: 0.119 seconds, Fetched 2 row(s)
MIT is expected to have 3 record key as mandatory field. above Jira desc did
not have one.
{code:java}
MERGE INTO hudi_global_index_mor t
> USING source_updates s
> ON t.record_key = s.record_key
> WHEN MATCHED THEN UPDATE SET department = s.department,
salary = s.salary, ts = s.ts
> WHEN NOT MATCHED THEN INSERT * ;
MERGE INTO field resolution error: No matching assignment found for target
table record key field `record_key`
spark-sql (default)> {code}
So, retried by setting record key
MERGE INTO hudi_global_index_mor t
USING source_updates s
ON t.record_key = s.record_key
WHEN MATCHED THEN UPDATE SET department = s.department, salary = s.salary,
record_key = s.record_key, ts = s.ts
WHEN NOT MATCHED THEN INSERT * ;
Command succeeded.
{code:java}
MERGE INTO hudi_global_index_mor t
> USING source_updates s
> ON t.record_key = s.record_key
> WHEN MATCHED THEN UPDATE SET department = s.department,
salary = s.salary, record_key = s.record_key, ts = s.ts
> WHEN NOT MATCHED THEN INSERT * ;
25/04/30 11:03:32 WARN HoodieTableFileSystemView: Partition:
department=Engineering is not available in store
25/04/30 11:03:32 WARN HoodieLogBlock: There are records without valid
positions. Skip writing record positions to the block header.
25/04/30 11:03:33 WARN log: Updating partition stats fast for:
hudi_global_index_mor_ro
25/04/30 11:03:33 WARN log: Updated size to 435953
25/04/30 11:03:33 WARN log: Updating partition stats fast for:
hudi_global_index_mor_rt
25/04/30 11:03:33 WARN log: Updated size to 435953
25/04/30 11:03:34 WARN log: Updating partition stats fast for:
hudi_global_index_mor
25/04/30 11:03:34 WARN log: Updated size to 435953
25/04/30 11:03:34 WARN HiveConf: HiveConf of name
hive.internal.ss.authz.settings.applied.marker does not exist
25/04/30 11:03:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout
does not exist
25/04/30 11:03:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait
does not exist
Time taken: 3.65 seconds
spark-sql (default)> {code}
Querying the table to confirm:
{code:java}
> select * from hudi_global_index_mor order by record_key ;
25/04/30 11:03:46 WARN PositionBasedFileGroupRecordBuffer: Falling back to
key based merge for Read
25/04/30 11:03:46 WARN PositionBasedFileGroupRecordBuffer: Falling back to
key based merge for Read
20250430110331085 20250430110331085_1_0 emp_001 department=Engineering
4ca675cc-7478-46a2-a555-d552638a380a-0_1-94-740_20250430110331085.parquet
emp_001 NULL NULL 95000.0 1598886200 Engineering
20250430105611180 20250430105611180_0_1 emp_002 department=Sales
2b49c5b2-1d99-40e2-b215-e5ef936ce9ed-0_0-36-339_20250430105611180.parquet
emp_002 Jane Smith 28 75000.0 1598886001 Sales
20250430105611180 20250430105611180_1_0 emp_003 department=Marketing
06783ae0-b1dd-40a4-be55-ee98e8f72e03-0_1-36-340_20250430105611180.parquet
emp_003 Bob Wilson 35 85000.0 1598886002 Marketing
20250430110331085 20250430110331085_1_1 emp_004 department=Engineering
4ca675cc-7478-46a2-a555-d552638a380a-0_1-94-740_20250430110331085.parquet
emp_004 Alice Brown 29 82000.0 1598886201 Engineering
Time taken: 0.317 seconds, Fetched 4 row(s) {code}
So, I could not reproduce the issue that user is reporting.
only diff is, I am using latest master.
;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]