ZiyueGuan created HUDI-1875:
-------------------------------

             Summary: Improve perf of MOR table upsert based on HDFS
                 Key: HUDI-1875
                 URL: https://issues.apache.org/jira/browse/HUDI-1875
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: ZiyueGuan


Problem: When we use upsert in MOR table, hudi assign one task for one fileId 
which needs to be created or updated. In such situation, near one million tasks 
may be created in most of which may simply append few records to a fileId. Such 
process may be slow and a few skew tasks appear.

Reason: hudi use hsync to guarantee data is stored properly.  Call hsync so 
much times towards a hdfs cluster in 2 minutes or less will lead to high IOPS 
for disks. In addition to this, creating too much tasks brings high overhead of 
scheduling tasks against append two or three records to a file.

TODO: 

Option One: use hflush instead of hsync. This may lead data loss when all DN 
shutdown at the same time. However, this has a quite low chance to occur when 
HDFS deploy across AZ.

Option two: make hsync process asynchronous and let more than one writing 
process run in the same task. This will reduce the task numbers but increase 
mem use.

I may first try option one as it is simple enough.

When



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to