[
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100967#comment-17100967
]
Yanjia Gary Li commented on HUDI-494:
-------------------------------------
Hi folks, this issue seems coming back again...
!example2_hdfs.png!
!example2_sparkui.png!
A very small(2GB) upsert job creates 60,000+ files in a single partition and
gets stuck for 10+ hours. I believe there might be a bug on the BloomIndexing
stage.
> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -------------------------------------------------------------
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
> Issue Type: Test
> Reporter: Yanjia Gary Li
> Assignee: Vinoth Chandar
> Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png,
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
> commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms.
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/
> folder in my HDFS. In the Spark UI, each task only writes less than 10
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
> All the stages before this seem normal. Any idea what happened here? My
> first guess would be something related to the bloom filter index. Maybe
> somewhere trigger the repartitioning with the bloom filter index? But I am
> not really familiar with that part of the code.
> Thanks
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)