Hi,
I tried to ingest records in S3 with 2 runs - 20K/50K partitions with 
bulk_insert mode and a COW table.

I can see all the process are considerably ok except the last process where we 
finalized the writes i.e. HoodieTable.finalizeWrite as it needs to scan through 
the whole directory structure in S3 and I can see big pause. Following are the 
logs -

I tried in Hadoop and its very quick. Has anyone used Hudi in S3 ? What is the 
recommended no. of partitions in S3 ? We have an average 20M records per table 
which we need to ingest in S3. 


2020-06-09 05:49:18,641 [Spark Context Cleaner] INFO  
org.apache.spark.ContextCleaner - Cleaned accumulator 144
2020-06-09 06:42:15,158 [dispatcher-event-loop-3] INFO  
org.apache.spark.scheduler.BlacklistTracker - Removing executors Set(2, 33, 24, 
26, 4, 6, 16, 3, 25, 13) from blacklist because the blacklist for those 
executors has timed out
2020-06-09 06:59:37,471 [Driver] INFO  org.apache.hudi.table.HoodieTable - 
Removing duplicate data files created due to spark retries before committing. 
Paths=[s3a:

Reply via email to