[ 
https://issues.apache.org/jira/browse/HADOOP-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711783#comment-16711783
 ] 

Adam Antal commented on HADOOP-15819:
-------------------------------------

Hi everyone!

I uploaded patch v1 with the fix.
 Remarks/explanation:
 * After thorough investigation of the cache I came to the conclusion that it 
is work as expected - I did not find any abnormal behaviour. The closed S3AFS 
must got into the cache somehow, so I tried further logging, and it showed that 
it was through the {{FileSystem}}'s following function:
{code:java}
@VisibleForTesting
   static void addFileSystemForTesting(URI uri, Configuration conf,
       FileSystem fs) throws IOException {
       CACHE.map.put(new Cache.Key(uri, conf), fs);
   }
{code}

 * It turned out that among the hadoop-aws integration tests there's only one 
usage of this function, and the stack trace is the following:
{code:java}
CACHE.map.put(new Cache.Key(uri, conf), fs) ->
  FileSystem.addFileSystemForTesting(uri, conf, fs) ->
    FileSystemTestHelper.addFileSystemForTesting(uri, conf, wrapperFS) ->
      AbstractITCommitProtocol.bindFileSystem(FileSystem fs, Path path, 
Configuration conf) ->
        AbstractITCommitProtocol.setup()
{code}

 * (Note that there are other usage in the test suite, but for integration 
tests this is the only one).
 * This function directly injects the FS into the FSCache (which can be in the 
cache already, but with another key), but as I wrote it in my previous comment, 
the caching is explicitly disabled in {{AbstractITCommitProtocol}} since in 
{{AbstractITCommitProtocol.createConfiguration()}} we call 
{{disableFilesystemCaching(conf)}}.
 * So this FS is in the cache, but it is never returned by {{FileSystem.get()}} 
since the cache is disabled. Instead a new S3AFileSystem object is returned in 
each call.
 * During teardown the FS is closed, and it's got removed from the cache *but 
the only removed key-value pair is the injected one.* If there was other 
key-value pair that has the same FS as value, it is kept in the cache. So after 
this action one key with one closed FS value as pair could remain in the 
FSCache.
 * The tests are running in the JVM so the next test case which tries to access 
a FS with {{s3a}} scheme without disabled cache gets this closed FS, and runs 
into the known error.
 * Since the cache is disabled there's no need to explicitly use that 
{{bindFileSystem}} method. My easy fix is to remove it.
 * This function is rarely used in tests, that may explain why we got this 
error and why no one has ever encountered errors like this. To sum it up, the 
cache is working as intended, but the misuse was the root of the issue. Maybe 
we should consider deleting it since it cause unforeseen effects like this 
(note that this function has only one usage except hadoop-aws).

Please verify my observations and please also check that the fix is working 
indeed.
 I also note that {{ITestS3ACommitterFactory}} is still failing, there was 
another error which got hidden by the "closed fs" error.

Regards,
 Adam

> S3A integration test failures: FileSystem is closed! - without parallel test 
> run
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-15819
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15819
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.1.1
>            Reporter: Gabor Bota
>            Assignee: Gabor Bota
>            Priority: Critical
>         Attachments: HADOOP-15819.000.patch, S3ACloseEnforcedFileSystem.java, 
> S3ACloseEnforcedFileSystem.java, closed_fs_closers_example_5klines.log.zip
>
>
> Running the integration tests for hadoop-aws {{mvn -Dscale verify}} against 
> Amazon AWS S3 (eu-west-1, us-west-1, with no s3guard) we see a lot of these 
> failures:
> {noformat}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.408 
> s <<< FAILURE! - in 
> org.apache.hadoop.fs.s3a.commit.staging.integration.ITDirectoryCommitMRJob
> [ERROR] 
> testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITDirectoryCommitMRJob)
>   Time elapsed: 0.027 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> [ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 4.345 
> s <<< FAILURE! - in 
> org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob
> [ERROR] 
> testStagingDirectory(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob)
>   Time elapsed: 0.021 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> [ERROR] 
> testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJob)
>   Time elapsed: 0.022 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.489 
> s <<< FAILURE! - in 
> org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJobBadDest
> [ERROR] 
> testMRJob(org.apache.hadoop.fs.s3a.commit.staging.integration.ITStagingCommitMRJobBadDest)
>   Time elapsed: 0.023 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.695 
> s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.magic.ITMagicCommitMRJob
> [ERROR] testMRJob(org.apache.hadoop.fs.s3a.commit.magic.ITMagicCommitMRJob)  
> Time elapsed: 0.039 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.015 
> s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory
> [ERROR] 
> testEverything(org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory)  
> Time elapsed: 0.014 s  <<< ERROR!
> java.io.IOException: s3a://cloudera-dev-gabor-ireland: FileSystem is closed!
> {noformat}
> The big issue is that the tests are running in a serial manner - no test is 
> running on top of the other - so we should not see that the tests are failing 
> like this. The issue could be in how we handle 
> org.apache.hadoop.fs.FileSystem#CACHE - the tests should use the same 
> S3AFileSystem so if A test uses a FileSystem and closes it in teardown then B 
> test will get the same FileSystem object from the cache and try to use it, 
> but it is closed.
> We see this a lot in our downstream testing too. It's not possible to tell 
> that the failed regression test result is an implementation issue in the 
> runtime code or a test implementation problem. 
> I've checked when and what closes the S3AFileSystem with a sightly modified 
> version of S3AFileSystem which logs the closers of the fs if an error should 
> occur. I'll attach this modified java file for reference. See the next 
> example of the result when it's running:
> {noformat}
> 2018-10-04 00:52:25,596 [Thread-4201] ERROR s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:checkIfClosed(74)) - Use after close(): 
> java.lang.RuntimeException: Using closed FS!.
>       at 
> org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.checkIfClosed(S3ACloseEnforcedFileSystem.java:73)
>       at 
> org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.mkdirs(S3ACloseEnforcedFileSystem.java:474)
>       at 
> org.apache.hadoop.fs.contract.AbstractFSContractTestBase.mkdirs(AbstractFSContractTestBase.java:338)
>       at 
> org.apache.hadoop.fs.contract.AbstractFSContractTestBase.setup(AbstractFSContractTestBase.java:193)
>       at 
> org.apache.hadoop.fs.s3a.ITestS3AClosedFS.setup(ITestS3AClosedFS.java:40)
>       at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> 2018-10-04 00:52:25,596 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(61)) - FS was closed 6 times by:
> 2018-10-04 00:52:25,596 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() 
> #1 ---------------
> 2018-10-04 00:52:25,596 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(65)) - 
> java.lang.RuntimeException: close() called here
>       at 
> org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83)
>       at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
>       at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
>       at 
> org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55)
>       at 
> org.apache.hadoop.fs.s3a.commit.AbstractCommitITest.teardown(AbstractCommitITest.java:206)
>       at 
> org.apache.hadoop.fs.s3a.auth.ITestAssumedRoleCommitOperations.teardown(ITestAssumedRoleCommitOperations.java:90)
>       at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> 2018-10-04 00:52:25,596 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() 
> #2 ---------------
> 2018-10-04 00:52:25,596 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(65)) - 
> java.lang.RuntimeException: close() called here
>       at 
> org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83)
>       at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
>       at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
>       at 
> org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55)
>       at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> 2018-10-04 00:52:25,597 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(64)) - ----------------- close() 
> #3 ---------------
> 2018-10-04 00:52:25,598 [Thread-4201] WARN  s3a.S3ACloseEnforcedFileSystem 
> (S3ACloseEnforcedFileSystem.java:logClosers(65)) - 
> java.lang.RuntimeException: close() called here
>       at 
> org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem.close(S3ACloseEnforcedFileSystem.java:83)
>       at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
>       at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
>       at 
> org.apache.hadoop.fs.s3a.AbstractS3ATestBase.teardown(AbstractS3ATestBase.java:55)
>       at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> (...)
> {noformat}
> From the logs it seems that the closers are the teardown methods of the 
> abstract test classes, so the same FileSystem object gets reused between 
> tests.
> To run a test with the modified fs the following should be in the config:
> {noformat}
> <property>
>   <name>fs.s3a.impl</name>
>   <value>org.apache.hadoop.fs.s3a.S3ACloseEnforcedFileSystem</value>
>   <description>The implementation class of the S3A Filesystem</description>
> </property>
> {noformat}
> I file this jira to solve this issue, and start a conversation if needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to