[ https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009180#comment-17009180 ]
Vinoth Chandar commented on HUDI-494: ------------------------------------- [~garyli1019] Thanks for reporting this. if you notice, the parallelism is intact, until the actual writing happens.. Hudi writing has a spark partition per file updated, and thus if your partitioning is too fine grained, you will write tons of files.. This field seems like a latitude? This can have arbitrary values right, do you really want to partition based on this? option("hoodie.datasource.write.partitionpath.field", "location"). >> I set the bulkInsertParallelism too high What was the value you used? I still feel this is coming from the partitioning, guessing from the code snippet. > [DEBUGGING] Huge amount of tasks when writing files into HDFS > ------------------------------------------------------------- > > Key: HUDI-494 > URL: https://issues.apache.org/jira/browse/HUDI-494 > Project: Apache Hudi (incubating) > Issue Type: Test > Reporter: Yanjia Gary Li > Assignee: Vinoth Chandar > Priority: Major > Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot > 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png > > > I am using the manual build master after > [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65] > commit. EDIT: tried with the latest master but got the same result > I am seeing 3 million tasks when the Hudi Spark job writing the files into > HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 > million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. > I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ > folder in my HDFS. In the Spark UI, each task only writes less than 10 > records in > {code:java} > count at HoodieSparkSqlWriter{code} > All the stages before this seem normal. Any idea what happened here? My > first guess would be something related to the bloom filter index. Maybe > somewhere trigger the repartitioning with the bloom filter index? But I am > not really familiar with that part of the code. > Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)