pftn opened a new issue, #8892:
URL: https://github.com/apache/hudi/issues/8892
**Describe the problem you faced**
use flink1.16,bucket index , mor,hudi master branch at commit: 6ef00d1
[HUDI-5816] List all partitions as the fallback mechanism in Hive and Glue Sync
(#8388)
**Environment Description**
Hudi version : master at commit: 6ef00d1 [HUDI-5816] List all partitions as
the fallback mechanism in Hive and Glue Sync (#8388)
Flink version : 1.16.0
Hadoop version : 3.1.1.3.1.5.0-152
Storage (HDFS/S3/GCS..) : HDFS
Running on Docker? (yes/no) : no
**Additional context**
**Files under partititon when error occur :**
20220604/.00000000-052e-4a9c-b004-bea8a573603d_20230513052137245.log.1_12-20-0
20220604/.00000000-052e-4a9c-b004-bea8a573603d_20230530145652212.log.1_12-20-0
20220604/.00000004-53f8-4b03-b790-6390e9e6f6f3_20230513052137245.log.1_16-20-0
20220604/.00000014-0043-44ed-8426-a72c4a7d27b0_20230513052137245.log.1_6-20-39
20220604/.hoodie_partition_metadata
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530145652212.parquet
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530150242839.parquet
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-6_20230513052137245.parquet
20220604/00000001-b562-41f4-80e4-43c829975d0b_11-20-6_20230513052137245.parquet
20220604/00000002-0826-4424-afb3-f60bd0711482_3-20-6_20230513052137245.parquet
20220604/00000003-67f4-4079-95ea-0fc16284a120_6-20-0_20230525000328419.parquet
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-0_20230525013419398.parquet
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-6_20230513052137245.parquet
20220604/00000006-e017-4e65-92e8-d028a81a31b3_0-20-0_20230510033838934.parquet
20220604/00000007-3477-401f-982e-e5ae38ca0e23_3-20-6_20230510170043301.parquet
20220604/00000007-4bc1-4340-a9d8-330666a58244_5-20-6_20230511183601566.parquet
20220604/00000008-3e72-42ab-b1cc-96b3264d7383_16-20-6_20230513052137245.parquet
20220604/00000009-5876-4e72-9cda-656772feb7a6_17-20-6_20230511183601566.parquet
20220604/00000009-c3bc-4ae4-a1e0-917970420ac7_1-20-6_20230510170043301.parquet
20220604/00000010-fd6f-4576-b440-ad3af56b0176_4-20-6_20230513052137245.parquet
20220604/00000012-7487-4944-84cd-d8315f077710_3-20-0_20230524050500776.parquet
20220604/00000013-de1e-42a5-a4be-8966d9b4180b_3-20-0_20230524223949535.parquet
20220604/00000014-0043-44ed-8426-a72c4a7d27b0_18-20-6_20230513052137245.parquet
20220604/00000015-04dd-44e3-8db6-c34c7f8c0e95_1-20-6_20230511183601566.parquet
20220604/00000016-23ce-44fd-9bf6-32731412accb_6-20-6_20230513052137245.parquet
20220604/00000018-1e68-446b-a68e-c1cc40c2aa1c_13-20-6_20230510170043301.parquet
20220604/00000018-9568-4796-8cf3-3717aa646dc2_5-20-6_20230511183601566.parquet
20220604/00000019-a295-4612-8884-e4b4503c0fe4_17-20-0_20230525001204325.parquet
**Stacktrace**
java.lang.RuntimeException: Duplicate fileId
00000009-c3bc-4ae4-a1e0-917970420ac7 from bucket 9 of partition 20220604 found
during the BucketStreamWriteFunction index bootstrap.
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:748)
java.lang.RuntimeException: Duplicate fileId
00000007-3477-401f-982e-e5ae38ca0e23 from bucket 7 of partition 20220604 found
during the BucketStreamWriteFunction index bootstrap.
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:748)
java.lang.RuntimeException: Duplicate fileId
00000018-9568-4796-8cf3-3717aa646dc2 from bucket 18 of partition 20220604 found
during the BucketStreamWriteFunction index bootstrap.
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
at
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:748)
**Files after run command repair deduplicate :**
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530150242839.parquet
20220604/00000001-b562-41f4-80e4-43c829975d0b_11-20-6_20230513052137245.parquet
20220604/00000002-0826-4424-afb3-f60bd0711482_3-20-6_20230513052137245.parquet
20220604/00000003-67f4-4079-95ea-0fc16284a120_6-20-0_20230525000328419.parquet
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-0_20230525013419398.parquet
20220604/00000006-e017-4e65-92e8-d028a81a31b3_0-20-0_20230510033838934.parquet
20220604/00000007-4bc1-4340-a9d8-330666a58244_5-20-6_20230511183601566.parquet
20220604/00000008-3e72-42ab-b1cc-96b3264d7383_16-20-6_20230513052137245.parquet
20220604/00000009-5876-4e72-9cda-656772feb7a6_17-20-6_20230511183601566.parquet
20220604/00000010-fd6f-4576-b440-ad3af56b0176_4-20-6_20230513052137245.parquet
20220604/00000012-7487-4944-84cd-d8315f077710_3-20-0_20230524050500776.parquet
20220604/00000013-de1e-42a5-a4be-8966d9b4180b_3-20-0_20230524223949535.parquet
20220604/00000014-0043-44ed-8426-a72c4a7d27b0_18-20-6_20230513052137245.parquet
20220604/00000015-04dd-44e3-8db6-c34c7f8c0e95_1-20-6_20230511183601566.parquet
20220604/00000016-23ce-44fd-9bf6-32731412accb_6-20-6_20230513052137245.parquet
20220604/00000018-9568-4796-8cf3-3717aa646dc2_5-20-6_20230511183601566.parquet
20220604/00000019-a295-4612-8884-e4b4503c0fe4_17-20-0_20230525001204325.parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]