[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

Yanjia Gary Li (Jira) Thu, 07 May 2020 18:39:08 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101055#comment-17101055
 ]


Yanjia Gary Li edited comment on HUDI-494 at 5/8/20, 1:38 AM:
--------------------------------------------------------------

-Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]-

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 was trying to write, it looks back to 
commit 1 to get an estimated size of every record, but because commit 1 has too 
little records so it's inaccurate and way too big. So Hudi will calculate 
record/file using the big record size number and get a very small record/file. 
This lead to many small files. 


was (Author: garyli1019):
Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -------------------------------------------------------------
>
>                 Key: HUDI-494
>                 URL: https://issues.apache.org/jira/browse/HUDI-494
>             Project: Apache Hudi (incubating)
>          Issue Type: Test
>            Reporter: Yanjia Gary Li
>            Assignee: Yanjia Gary Li
>            Priority: Major
>         Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

Reply via email to