[jira] [Comment Edited] (HADOOP-16278) With S3A Filesystem, Long Running services End up Doing lot of GC and eventually die

Aaron Fabbri (JIRA) Wed, 08 May 2019 10:48:22 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835783#comment-16835783
 ]


Aaron Fabbri edited comment on HADOOP-16278 at 5/8/19 5:47 PM:
---------------------------------------------------------------

Agreed, +1 this simple patch stopping the quantiles on FS close. Also wanted to 
say nice work on this Jira [~prongs].


was (Author: fabbri):
Agreed, +1 this simple patch stopping the quantiles on FS close.

> With S3A Filesystem, Long Running services End up Doing lot of GC and 
> eventually die
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16278
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common, hadoop-aws, metrics
>    Affects Versions: 3.1.0, 3.1.1, 3.1.2
>            Reporter: Rajat Khandelwal
>            Priority: Major
>             Fix For: 3.1.3
>
>         Attachments: HADOOP-16278.patch, Screenshot 2019-04-30 at 12.52.42 
> PM.png, Screenshot 2019-04-30 at 2.33.59 PM.png
>
>
> I'll start with the symptoms and eventually come to the cause. 
>  
> We are using HDP 3.1 and Noticed that every couple of days the Hive Metastore 
> starts doing GC, sometimes with 30 minute long pauses. Although nothing is 
> collected and the Heap remains fully used. 
>  
> Next, we looked at the Heap Dump and found that 99% of the memory is taken up 
> by one Executor Service for its task queue. 
>  
> !Screenshot 2019-04-30 at 12.52.42 PM.png!
> The Instance is Created like this:
> {{ private static final ScheduledExecutorService scheduler = Executors}}
>  {{ .newScheduledThreadPool(1, new ThreadFactoryBuilder().setDaemon(true)}}
>  {{ .setNameFormat("MutableQuantiles-%d").build());}}
>  
> So All the instances of MutableQuantiles are using a Shared single threaded 
> ExecutorService
> The second thing to notice is this block of code in the Constructor of 
> MutableQuantiles:
> {{this.scheduledTask = scheduler.scheduleAtFixedRate(new 
> MutableQuantiles.RolloverSample(this), (long)interval, (long)interval, 
> TimeUnit.SECONDS);}}
> So As soon as a MutableQuantiles Instance is created, one task is scheduled 
> at Fix Rate. Instead of that, it could schedule them at Fixed Delay (Refer 
> HADOOP-16248). 
> Now coming to why it's related to S3. 
>  
> S3AFileSystem Creates an instance of S3AInstrumentation, which creates two 
> quantiles (related to S3Guard) with 1s(hardcoded) interval and leaves them 
> hanging. By hanging I mean perpetually scheduled. As and when new Instances 
> of S3AFileSystem are created, two new quantiles are created, which in turn 
> create two scheduled tasks and never cancel them. This way number of 
> scheduled tasks keeps on growing without ever getting cleaned up, leading to 
> GC/OOM/Crash. 
>  
> MutableQuantiles has a numInfo field which tells things like the name of the 
> metric. From the Heapdump, I found one numInfo and traced all objects 
> referencing that.
>  
> !Screenshot 2019-04-30 at 2.33.59 PM.png!
>  
> There seem to be 300K objects of for the same metric 
> (S3Guard_metadatastore_throttle_rate). 
> As expected, there are other 300K objects for the other MutableQuantiles 
> created by S3AInstrumentation class. 
> Although the number of instances of S3AInstrumentation class is only 4. 
> Clearly, there is a leak. One S3AInstrumentation instance is creating two 
> scheduled tasks to be run every second. These tasks are left scheduled and 
> not cancelled when S3AInstrumentation.close() is called. Hence, they are 
> never cleaned up. GC is also not able to collect them since they are referred 
> by the scheduler. 
> Who creates S3AInstrumentation instances? S3AFileSystem.initialize(), which 
> is called in FileSystem.get(URI, Configuration). Since hive metastore is a 
> service that deals with a lot of Path Objects and hence needs to do a lot of 
> calls to FileSystem.get, it's the one to first shows these symptoms. 
> We're seeing similar symptoms in AM for long-running jobs (for both Tez AM 
> and MR AM). 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HADOOP-16278) With S3A Filesystem, Long Running services End up Doing lot of GC and eventually die

Reply via email to