[jira] [Comment Edited] (HDFS-10323) transient deleteOnExit failure in ViewFileSystem due to close() ordering

Wenxin He (JIRA) Wed, 01 Nov 2017 04:52:54 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233962#comment-16233962
 ]


Wenxin He edited comment on HDFS-10323 at 11/1/17 11:51 AM:
------------------------------------------------------------

I find this problem too when using spark. And undeleted files leading to HDFS 
cluster no space left.

So according to [~bpodgursky]'s suggestion and [~cmccabe]'s comment
bq. 2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then 
all other FileSystems.

I submit 001 patch to fix the problem:
In this patch 
# FileSystem.Cache.map changed to {color:red}LinkedHashmap{color} in which fs 
are stored in {color:red}insertion order{color}.
When ViewFileSystem is initialized, DistributedFileSystem is first stored in 
FileSystem.Cache.map and then ViewFileSystem.
# When FileSystem.Cache.closeAll invoke, all cached fs {color:red}close 
inversely{color}, which like LiFO model. So ViewFileSystem close before its 
referred DistributedFileSystems, and all deleteOnExit files will be deleted 
safely before DistributedFileSystems close.


was (Author: vincent he):
I find this problem too when using spark. And undeleted files leading to HDFS 
cluster no space left.

So according to [~bpodgursky]'s suggestion and [~cmccabe]'s comment
bq. 2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then 
all other FileSystems.

I submit 001 patch to fix the problem:
In this patch FileSystem.Cache.map changed to LinkedHashmap in which fs are 
stored in insertion order.
When ViewFileSystem is initialized, DistributedFileSystem is first stored in 
FileSystem.Cache.map and then ViewFileSystem.
When FileSystem.Cache.closeAll invoke, all cached fs close inversely, which 
like LiFO model. So ViewFileSystem close before its referred 
DistributedFileSystems, and all deleteOnExit files will be deleted safely 
before DistributedFileSystems close.

> transient deleteOnExit failure in ViewFileSystem due to close() ordering
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10323
>                 URL: https://issues.apache.org/jira/browse/HDFS-10323
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: federation
>    Affects Versions: 2.6.0, 2.7.4, 3.0.0-beta1
>            Reporter: Ben Podgursky
>            Assignee: Wenxin He
>            Priority: Major
>         Attachments: HDFS-10323.001.patch
>
>
> After switching to using a ViewFileSystem, fs.deleteOnExit calls began 
> failing frequently, displaying this error on failure:
> 16/04/21 13:56:24 INFO fs.FileSystem: Ignoring failure to deleteOnExit for 
> path /tmp/delete_on_exit_test_123/a438afc0-a3ca-44f1-9eb5-010ca4a62d84
> Since FileSystem eats the error involved, it is difficult to be sure what the 
> error is, but I believe what is happening is that the ViewFileSystem’s child 
> FileSystems are being close()’d before the ViewFileSystem, due to the random 
> order ClientFinalizer closes FileSystems; so then when the ViewFileSystem 
> tries to close(), it tries to forward the delete() calls to the appropriate 
> child, and fails because the child is already closed.
> I’m unsure how to write an actual Hadoop test to reproduce this, since it 
> involves testing behavior on actual JVM shutdown.  However, I can verify that 
> while
> {code:java}
> fs.deleteOnExit(randomTemporaryDir); 
> {code}
> regularly (~50% of the time) fails to delete the temporary directory, this 
> code:
> {code:java}
> ViewFileSystem viewfs = (ViewFileSystem)fs1; 
> for (FileSystem fileSystem : viewfs.getChildFileSystems()) {   
>   if (fileSystem.exists(randomTemporaryDir)) {     
>     fileSystem.deleteOnExit(randomTemporaryDir);   
>   }
>  } 
> {code}
> always successfully deletes the temporary directory on JVM shutdown.
> I am not very familiar with FileSystem inheritance hierarchies, but at first 
> glance I see two ways to fix this behavior:
> 1)  ViewFileSystem could forward deleteOnExit calls to the appropriate child 
> FileSystem, and not hold onto that path itself.
> 2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then all 
> other FileSystems.  
> Would appreciate any thoughts of whether this seems accurate, and thoughts 
> (or help) on the fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-10323) transient deleteOnExit failure in ViewFileSystem due to close() ordering

Reply via email to