[
https://issues.apache.org/jira/browse/HDFS-9909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225381#comment-15225381
]
Xiao Chen commented on HDFS-9909:
---------------------------------
Thanks for reporting this Bogdan. I'd like to work on this.
The test program attached actually passes on trunk. I further tested on
branch-2.7 and it fails. The reason for this is that we have HDFS-5356 which
automatically closes DFS on trunk.
To double check this, if I change the {{FileSystem}} to
{{DistributedFileSystem}}, and call {{close}} before shutdown, the test passes
on branch-2.7 as well.
{code:java}
/* start cluster and write something */
// FileSystem fs = cluster.getFileSystem();
DistributedFileSystem fs = cluster.getFileSystem(); //
<=========== change to DFS
FSDataOutputStream out = fs.create(path, true);
out.write(bytes);
out.hflush();
/* stop cluster while file is open for writing */
cluster.shutdown();
fs.close(); // <=========== move this line up by 1
....
{code}
So now comes the original question: what if the client didn't call
{{DFS#close}}, and HDFS restarts? Currently the file lease is revoked after the
hard limit (1hr). Before that READ fails (e.g. {{cp: Cannot obtain block length
for ...}}). This is because {{fs.delete}} or {{fs.create}} will remove the
lease, but {{fs.open}} will not.
Since it's the read that fails, I don't think it's a good idea to recover the
lease on the read operation. One possible solution is perhaps to detect this
and revoke on restart. I need to further investigate and will update here.
> Can't read file after hdfs restart
> ----------------------------------
>
> Key: HDFS-9909
> URL: https://issues.apache.org/jira/browse/HDFS-9909
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client, namenode
> Affects Versions: 2.7.1, 2.7.2
> Reporter: Bogdan Raducanu
> Priority: Critical
> Attachments: Main.java
>
>
> If HDFS is restarted while a file is open for writing then new clients can't
> read that file until the hard lease limit expires and block recovery starts.
> Scenario:
> 1. write to file, call hflush
> 2. without closing the file, restart hdfs
> 3. after hdfs is back up, opening file for reading from a new client fails
> for 1 hour
> Repro attached.
> Thoughts:
> * possibly this also happens in other cases not just when hdfs is restarted
> (e.g. only all datanodes in pipeline are restarted)
> * As far as I can tell this happens because the last block is RWR and
> getReplicaVisibleLength returns -1 for this. The recovery starts after hard
> lease limit expires (so file is readable only after 1 hour).
> * one can call recoverLease which will start the lease recovery sooner, BUT,
> how can one know when to call this? The exception thrown is IOException which
> can happen for other reasons.
> I think a reasonable solution would be to return a specialized exception
> (similar to AlreadyBeingCreatedException when trying to write to open file).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)