pftn opened a new issue, #8892:
URL: https://github.com/apache/hudi/issues/8892

   **Describe the problem you faced**
   
   use flink1.16,bucket index , mor,hudi master branch at commit: 6ef00d1 
[HUDI-5816] List all partitions as the fallback mechanism in Hive and Glue Sync 
(#8388)
   
   **Environment Description**
   
   Hudi version : master at commit: 6ef00d1 [HUDI-5816] List all partitions as 
the fallback mechanism in Hive and Glue Sync (#8388)
   Flink version : 1.16.0
   Hadoop version : 3.1.1.3.1.5.0-152
   Storage (HDFS/S3/GCS..) : HDFS
   Running on Docker? (yes/no) : no
   
   **Additional context**
   
   **Files under partititon when error occur :**
   
20220604/.00000000-052e-4a9c-b004-bea8a573603d_20230513052137245.log.1_12-20-0
   
20220604/.00000000-052e-4a9c-b004-bea8a573603d_20230530145652212.log.1_12-20-0
   
20220604/.00000004-53f8-4b03-b790-6390e9e6f6f3_20230513052137245.log.1_16-20-0
   
20220604/.00000014-0043-44ed-8426-a72c4a7d27b0_20230513052137245.log.1_6-20-39
   20220604/.hoodie_partition_metadata
   
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530145652212.parquet
   
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530150242839.parquet
   
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-6_20230513052137245.parquet
   
20220604/00000001-b562-41f4-80e4-43c829975d0b_11-20-6_20230513052137245.parquet
   
20220604/00000002-0826-4424-afb3-f60bd0711482_3-20-6_20230513052137245.parquet
   
20220604/00000003-67f4-4079-95ea-0fc16284a120_6-20-0_20230525000328419.parquet
   
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-0_20230525013419398.parquet
   
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-6_20230513052137245.parquet
   
20220604/00000006-e017-4e65-92e8-d028a81a31b3_0-20-0_20230510033838934.parquet
   
20220604/00000007-3477-401f-982e-e5ae38ca0e23_3-20-6_20230510170043301.parquet
   
20220604/00000007-4bc1-4340-a9d8-330666a58244_5-20-6_20230511183601566.parquet
   
20220604/00000008-3e72-42ab-b1cc-96b3264d7383_16-20-6_20230513052137245.parquet
   
20220604/00000009-5876-4e72-9cda-656772feb7a6_17-20-6_20230511183601566.parquet
   
20220604/00000009-c3bc-4ae4-a1e0-917970420ac7_1-20-6_20230510170043301.parquet
   
20220604/00000010-fd6f-4576-b440-ad3af56b0176_4-20-6_20230513052137245.parquet
   
20220604/00000012-7487-4944-84cd-d8315f077710_3-20-0_20230524050500776.parquet
   
20220604/00000013-de1e-42a5-a4be-8966d9b4180b_3-20-0_20230524223949535.parquet
   
20220604/00000014-0043-44ed-8426-a72c4a7d27b0_18-20-6_20230513052137245.parquet
   
20220604/00000015-04dd-44e3-8db6-c34c7f8c0e95_1-20-6_20230511183601566.parquet
   
20220604/00000016-23ce-44fd-9bf6-32731412accb_6-20-6_20230513052137245.parquet
   
20220604/00000018-1e68-446b-a68e-c1cc40c2aa1c_13-20-6_20230510170043301.parquet
   
20220604/00000018-9568-4796-8cf3-3717aa646dc2_5-20-6_20230511183601566.parquet
   
20220604/00000019-a295-4612-8884-e4b4503c0fe4_17-20-0_20230525001204325.parquet
   
   **Stacktrace**
   java.lang.RuntimeException: Duplicate fileId 
00000009-c3bc-4ae4-a1e0-917970420ac7 from bucket 9 of partition 20220604 found 
during the BucketStreamWriteFunction index bootstrap.
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
        at 
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
        at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
        at java.lang.Thread.run(Thread.java:748)
   
   java.lang.RuntimeException: Duplicate fileId 
00000007-3477-401f-982e-e5ae38ca0e23 from bucket 7 of partition 20220604 found 
during the BucketStreamWriteFunction index bootstrap.
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
        at 
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
        at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
        at java.lang.Thread.run(Thread.java:748)
   
   java.lang.RuntimeException: Duplicate fileId 
00000018-9568-4796-8cf3-3717aa646dc2 from bucket 18 of partition 20220604 found 
during the BucketStreamWriteFunction index bootstrap.
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:167)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:160)
        at 
org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:112)
        at 
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
        at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
        at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
        at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
        at java.lang.Thread.run(Thread.java:748)
   
   **Files after run command repair deduplicate :**
   
20220604/00000000-052e-4a9c-b004-bea8a573603d_5-20-0_20230530150242839.parquet
   
20220604/00000001-b562-41f4-80e4-43c829975d0b_11-20-6_20230513052137245.parquet
   
20220604/00000002-0826-4424-afb3-f60bd0711482_3-20-6_20230513052137245.parquet
   
20220604/00000003-67f4-4079-95ea-0fc16284a120_6-20-0_20230525000328419.parquet
   
20220604/00000004-53f8-4b03-b790-6390e9e6f6f3_1-20-0_20230525013419398.parquet
   
20220604/00000006-e017-4e65-92e8-d028a81a31b3_0-20-0_20230510033838934.parquet
   
20220604/00000007-4bc1-4340-a9d8-330666a58244_5-20-6_20230511183601566.parquet
   
20220604/00000008-3e72-42ab-b1cc-96b3264d7383_16-20-6_20230513052137245.parquet
   
20220604/00000009-5876-4e72-9cda-656772feb7a6_17-20-6_20230511183601566.parquet
   
20220604/00000010-fd6f-4576-b440-ad3af56b0176_4-20-6_20230513052137245.parquet
   
20220604/00000012-7487-4944-84cd-d8315f077710_3-20-0_20230524050500776.parquet
   
20220604/00000013-de1e-42a5-a4be-8966d9b4180b_3-20-0_20230524223949535.parquet
   
20220604/00000014-0043-44ed-8426-a72c4a7d27b0_18-20-6_20230513052137245.parquet
   
20220604/00000015-04dd-44e3-8db6-c34c7f8c0e95_1-20-6_20230511183601566.parquet
   
20220604/00000016-23ce-44fd-9bf6-32731412accb_6-20-6_20230513052137245.parquet
   
20220604/00000018-9568-4796-8cf3-3717aa646dc2_5-20-6_20230511183601566.parquet
   
20220604/00000019-a295-4612-8884-e4b4503c0fe4_17-20-0_20230525001204325.parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to