Appending and Truncating outputStream in HDFS

孙泽嵩 Sun, 22 Sep 2019 03:13:22 -0700

Hi,


Recently a considerate user has reported a bug that data stored in HDFS cannot 
be found when restarting server [1]. During my fixing, I found some information 
that may be useful and I'd like to share with you.
First of all, HDFS truncate function is developed after 2.7.0 [2], so if users 
have installed previous Hadoop version, they will get errors "truncate function 
is not supported" in server. (Actually I did not find hint about Hadoop version 
on our official website...)
Secondly, in HDFS, appending and creating are totally different operations. In 
Java File System, we use `new FileOutputStream(file, append)` to 
create(`append` = false) or append(`append` = true) FileOutputStream. However 
in HDFS, `fs.append(path)` is used to append FSDataOutputStream  when the path 
exists, and `fs.create(path, overwrite)` is used to create or overwrite 
FSDataOutputStream. If a file with this name already exists, then if true, the 
file will be overwritten, and if false an exception will be thrown.
Thirdly, HDFS use “lease” mechanism to maintain data consistency, which means 
when each client program writes data to a file, other client programs are not 
allowed to write data to the file at the same time. This affects appending and 
truncating a lot, because every time I tried to call truncate method, a 
AlreadyBeenCreatedException will be thrown. So here I change the codes to: 
```
    if (fs.exists(path)) {
      fsDataOutputStream.close(); // close stream first
    }
    fs.truncate(path, position); // truncate
    if (fs.exists(path)) {
      fsDataOutputStream = fs.append(path); // open stream again
    }
```
Do you have any better suggestions and ideas? Welcome to discuss with me.

[1] https://issues.apache.org/jira/browse/IOTDB-231
[2] https://issues.apache.org/jira/browse/HDFS-3107




BR,
--
Zesong Sun
School of Software, Tsinghua University

孙泽嵩
清华大学 软件学院

Appending and Truncating outputStream in HDFS

Reply via email to