Hi,
Recently a considerate user has reported a bug that data stored in HDFS cannot
be found when restarting server [1]. During my fixing, I found some information
that may be useful and I'd like to share with you.
First of all, HDFS truncate function is developed after 2.7.0 [2], so if users
have installed previous Hadoop version, they will get errors "truncate function
is not supported" in server. (Actually I did not find hint about Hadoop version
on our official website...)
Secondly, in HDFS, appending and creating are totally different operations. In
Java File System, we use `new FileOutputStream(file, append)` to
create(`append` = false) or append(`append` = true) FileOutputStream. However
in HDFS, `fs.append(path)` is used to append FSDataOutputStream when the path
exists, and `fs.create(path, overwrite)` is used to create or overwrite
FSDataOutputStream. If a file with this name already exists, then if true, the
file will be overwritten, and if false an exception will be thrown.
Thirdly, HDFS use “lease” mechanism to maintain data consistency, which means
when each client program writes data to a file, other client programs are not
allowed to write data to the file at the same time. This affects appending and
truncating a lot, because every time I tried to call truncate method, a
AlreadyBeenCreatedException will be thrown. So here I change the codes to:
```
if (fs.exists(path)) {
fsDataOutputStream.close(); // close stream first
}
fs.truncate(path, position); // truncate
if (fs.exists(path)) {
fsDataOutputStream = fs.append(path); // open stream again
}
```
Do you have any better suggestions and ideas? Welcome to discuss with me.
[1] https://issues.apache.org/jira/browse/IOTDB-231
[2] https://issues.apache.org/jira/browse/HDFS-3107
BR,
--
Zesong Sun
School of Software, Tsinghua University
孙泽嵩
清华大学 软件学院