Hi, I tried to ingest records in S3 with 2 runs - 20K/50K partitions with bulk_insert mode and a COW table.
I can see all the process are considerably ok except the last process where we finalized the writes i.e. HoodieTable.finalizeWrite as it needs to scan through the whole directory structure in S3 and I can see big pause. Following are the logs - I tried in Hadoop and its very quick. Has anyone used Hudi in S3 ? What is the recommended no. of partitions in S3 ? We have an average 20M records per table which we need to ingest in S3. 2020-06-09 05:49:18,641 [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 144 2020-06-09 06:42:15,158 [dispatcher-event-loop-3] INFO org.apache.spark.scheduler.BlacklistTracker - Removing executors Set(2, 33, 24, 26, 4, 6, 16, 3, 25, 13) from blacklist because the blacklist for those executors has timed out 2020-06-09 06:59:37,471 [Driver] INFO org.apache.hudi.table.HoodieTable - Removing duplicate data files created due to spark retries before committing. Paths=[s3a:
