[ 
https://issues.apache.org/jira/browse/HADOOP-18091?focusedWorklogId=715226&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-715226
 ]

ASF GitHub Bot logged work on HADOOP-18091:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Jan/22 19:18
            Start Date: 25/Jan/22 19:18
    Worklog Time Spent: 10m 
      Work Description: steveloughran opened a new pull request #3930:
URL: https://github.com/apache/hadoop/pull/3930


   Adds a new map type WeakReferenceMap, which stores weak references to values,
   and a WeakReferenceThreadMap subclass to more closely resemble a thread local
   type, as it is a map of thread id to value.
   
   construct it with a factory method and optional callback
   for notification on loss and regeneration.
   
    WeakReferenceThreadMap<WrappingAuditSpan> activeSpan =
         new WeakReferenceThreadMap<>(
             (k) -> getUnbondedSpan(),
             this::noteSpanReferenceLost);
   
   This is used in ActiveAuditManagerS3A for span tracking.
   
   If a calling method has a reference to the span then even with gc the 
reference will
   be valid, it's only if no more references are held that there will be 
problems.
   
   This does mean that if S3 a code keeps references around then back 
references to the auditor are retained.
   
   but those classes which can get returned and which do have spans (list 
iterators, input and output streams, ...)
   And all have callbacks into the main S3a file system anyway. 
   
   Testing this has been fun; about as hard as the production code.
   
   The good news: We can do this in a unit test relatively quickly.
   We just create a sequence of audit managers and in each one
   Schedule tasks across a thread pool to create spans.
   By providing an auditor implementation whose class and spans use lots of 
memory
   We can trigger OOM fast on the original code.
   
   With the new structure this doesn't happen. Instead, and after 100+ 
iterations
   GC calls trigger removal of the only-weakly-referenced spans.
   
   I have run the Itest suites against s3 london and all is well; retesting.
   
   my system is set to fail if any operation is ever executed out of a span, 
other than those which happened during copy operations when the S3 SDK transfer 
manager invokes them.
   
   That is the sole risk I can see in this world: that if and external 
reference is not held then the thread reference will be discarded prematurely.
   
   I don't see it happening in this case but will review carefully just to make 
sure.
   
   
   ### How was this patch tested?
   
   New test which triggers oom on the old code, but works now.
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [X] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 715226)
    Time Spent: 3h 50m  (was: 3h 40m)

> S3A auditing leaks memory through ThreadLocal references
> --------------------------------------------------------
>
>                 Key: HADOOP-18091
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18091
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.2
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> {{ActiveAuditManagerS3A}} uses thread locals to map to active audit spans, 
> which (because they are wrapped) include back reference to the audit manager 
> instance and the config it was created with.
> these *do not* get cleaned up when the FS instance is closed.
> if you have a long lived process creating and destroying many FS instances, 
> then memory gets used up. l



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to