[ 
https://issues.apache.org/jira/browse/HDFS-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Podgursky updated HDFS-10323:
---------------------------------
    Description: 
After switching to using a ViewFileSystem, fs.deleteOnExit calls began failing 
frequently, displaying this error on failure:

16/04/21 13:56:24 INFO fs.FileSystem: Ignoring failure to deleteOnExit for path 
/tmp/delete_on_exit_test_123/a438afc0-a3ca-44f1-9eb5-010ca4a62d84

Since FileSystem eats the error involved, it is difficult to be sure what the 
error is, but I believe what is happening is that the ViewFileSystem’s child 
FileSystems are being close()’d before the ViewFileSystem, due to the random 
order ClientFinalizer closes FileSystems; so then when the ViewFileSystem tries 
to close(), it tries to forward the delete() calls to the appropriate child, 
and fails because the child is already closed.

I’m unsure how to write an actual Hadoop test to reproduce this, since it 
involves testing behavior on actual JVM shutdown.  However, I can verify that 
while

{code:java}
fs.deleteOnExit(randomTemporaryDir);

{code}

regularly (~50% of the time) fails to delete the temporary directory, this code:

{code:java}
ViewFileSystem viewfs = (ViewFileSystem)fs1;

for (FileSystem fileSystem : viewfs.getChildFileSystems()) {
  
  if (fileSystem.exists(randomTemporaryDir)) {
    
    fileSystem.deleteOnExit(randomTemporaryDir);
  
  }

}

{code}

always successfully deletes the temporary directory on JVM shutdown.

I am not very familiar with FileSystem inheritance hierarchies, but at first 
glance I see two ways to fix this behavior:

1)  ViewFileSystem could forward deleteOnExit calls to the appropriate child 
FileSystem, and not hold onto that path itself.

2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then all 
other FileSystems.  

Would appreciate any thoughts of whether this seems accurate, and thoughts (or 
help) on the fix.

  was:
After switching to using a ViewFileSystem, fs.deleteOnExit calls began failing 
frequently, displaying this error on failure:

16/04/21 13:56:24 INFO fs.FileSystem: Ignoring failure to deleteOnExit for path 
/tmp/delete_on_exit_test_123/a438afc0-a3ca-44f1-9eb5-010ca4a62d84

Since FileSystem eats the error involved, it is difficult to be sure what the 
error is, but I believe what is happening is that the ViewFileSystem’s child 
FileSystems are being close()’d before the ViewFileSystem, due to the random 
order ClientFinalizer closes FileSystems; so then when the ViewFileSystem tries 
to close(), it tries to forward the delete() calls to the appropriate child, 
and fails because the child is already closed.

I’m unsure how to write an actual Hadoop test to reproduce this, since it 
involves testing behavior on actual JVM shutdown.  However, I can verify that 
while

{code:java}
fs.deleteOnExit(randomTemporaryDir);

{code}

regularly (~50% of the time) fails to delete the temporary directory, this code:

{code:java}
ViewFileSystem viewfs = (ViewFileSystem)fs1;
for (FileSystem fileSystem : 
viewfs.getChildFileSystems()) {
  if (fileSystem.exists(randomTemporaryDir)) {
 
   fileSystem.deleteOnExit(randomTemporaryDir);
  }
}

{code}

always successfully deletes the temporary directory on JVM shutdown.

I am not very familiar with FileSystem inheritance hierarchies, but at first 
glance I see two ways to fix this behavior:

1)  ViewFileSystem could forward deleteOnExit calls to the appropriate child 
FileSystem, and not hold onto that path itself.

2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then all 
other FileSystems.  

Would appreciate any thoughts of whether this seems accurate, and thoughts (or 
help) on the fix.


> transient deleteOnExit failure in ViewFileSystem due to close() ordering
> ------------------------------------------------------------------------
>
>                 Key: HDFS-10323
>                 URL: https://issues.apache.org/jira/browse/HDFS-10323
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: federation
>            Reporter: Ben Podgursky
>
> After switching to using a ViewFileSystem, fs.deleteOnExit calls began 
> failing frequently, displaying this error on failure:
> 16/04/21 13:56:24 INFO fs.FileSystem: Ignoring failure to deleteOnExit for 
> path /tmp/delete_on_exit_test_123/a438afc0-a3ca-44f1-9eb5-010ca4a62d84
> Since FileSystem eats the error involved, it is difficult to be sure what the 
> error is, but I believe what is happening is that the ViewFileSystem’s child 
> FileSystems are being close()’d before the ViewFileSystem, due to the random 
> order ClientFinalizer closes FileSystems; so then when the ViewFileSystem 
> tries to close(), it tries to forward the delete() calls to the appropriate 
> child, and fails because the child is already closed.
> I’m unsure how to write an actual Hadoop test to reproduce this, since it 
> involves testing behavior on actual JVM shutdown.  However, I can verify that 
> while
> {code:java}
> fs.deleteOnExit(randomTemporaryDir);

> {code}
> regularly (~50% of the time) fails to delete the temporary directory, this 
> code:
> {code:java}
> ViewFileSystem viewfs = (ViewFileSystem)fs1;

> for (FileSystem fileSystem : viewfs.getChildFileSystems()) {
  
>   if (fileSystem.exists(randomTemporaryDir)) {
    
>     fileSystem.deleteOnExit(randomTemporaryDir);
  
>   }
> 
}

> {code}
> always successfully deletes the temporary directory on JVM shutdown.
> I am not very familiar with FileSystem inheritance hierarchies, but at first 
> glance I see two ways to fix this behavior:
> 1)  ViewFileSystem could forward deleteOnExit calls to the appropriate child 
> FileSystem, and not hold onto that path itself.
> 2) FileSystem.Cache.closeAll could first close all ViewFileSystems, then all 
> other FileSystems.  
> Would appreciate any thoughts of whether this seems accurate, and thoughts 
> (or help) on the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to