[ https://issues.apache.org/jira/browse/HADOOP-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835783#comment-16835783 ]
Aaron Fabbri edited comment on HADOOP-16278 at 5/8/19 5:47 PM: --------------------------------------------------------------- Agreed, +1 this simple patch stopping the quantiles on FS close. Also wanted to say nice work on this Jira [~prongs]. was (Author: fabbri): Agreed, +1 this simple patch stopping the quantiles on FS close. > With S3A Filesystem, Long Running services End up Doing lot of GC and > eventually die > ------------------------------------------------------------------------------------ > > Key: HADOOP-16278 > URL: https://issues.apache.org/jira/browse/HADOOP-16278 > Project: Hadoop Common > Issue Type: Bug > Components: common, hadoop-aws, metrics > Affects Versions: 3.1.0, 3.1.1, 3.1.2 > Reporter: Rajat Khandelwal > Priority: Major > Fix For: 3.1.3 > > Attachments: HADOOP-16278.patch, Screenshot 2019-04-30 at 12.52.42 > PM.png, Screenshot 2019-04-30 at 2.33.59 PM.png > > > I'll start with the symptoms and eventually come to the cause. > > We are using HDP 3.1 and Noticed that every couple of days the Hive Metastore > starts doing GC, sometimes with 30 minute long pauses. Although nothing is > collected and the Heap remains fully used. > > Next, we looked at the Heap Dump and found that 99% of the memory is taken up > by one Executor Service for its task queue. > > !Screenshot 2019-04-30 at 12.52.42 PM.png! > The Instance is Created like this: > {{ private static final ScheduledExecutorService scheduler = Executors}} > {{ .newScheduledThreadPool(1, new ThreadFactoryBuilder().setDaemon(true)}} > {{ .setNameFormat("MutableQuantiles-%d").build());}} > > So All the instances of MutableQuantiles are using a Shared single threaded > ExecutorService > The second thing to notice is this block of code in the Constructor of > MutableQuantiles: > {{this.scheduledTask = scheduler.scheduleAtFixedRate(new > MutableQuantiles.RolloverSample(this), (long)interval, (long)interval, > TimeUnit.SECONDS);}} > So As soon as a MutableQuantiles Instance is created, one task is scheduled > at Fix Rate. Instead of that, it could schedule them at Fixed Delay (Refer > HADOOP-16248). > Now coming to why it's related to S3. > > S3AFileSystem Creates an instance of S3AInstrumentation, which creates two > quantiles (related to S3Guard) with 1s(hardcoded) interval and leaves them > hanging. By hanging I mean perpetually scheduled. As and when new Instances > of S3AFileSystem are created, two new quantiles are created, which in turn > create two scheduled tasks and never cancel them. This way number of > scheduled tasks keeps on growing without ever getting cleaned up, leading to > GC/OOM/Crash. > > MutableQuantiles has a numInfo field which tells things like the name of the > metric. From the Heapdump, I found one numInfo and traced all objects > referencing that. > > !Screenshot 2019-04-30 at 2.33.59 PM.png! > > There seem to be 300K objects of for the same metric > (S3Guard_metadatastore_throttle_rate). > As expected, there are other 300K objects for the other MutableQuantiles > created by S3AInstrumentation class. > Although the number of instances of S3AInstrumentation class is only 4. > Clearly, there is a leak. One S3AInstrumentation instance is creating two > scheduled tasks to be run every second. These tasks are left scheduled and > not cancelled when S3AInstrumentation.close() is called. Hence, they are > never cleaned up. GC is also not able to collect them since they are referred > by the scheduler. > Who creates S3AInstrumentation instances? S3AFileSystem.initialize(), which > is called in FileSystem.get(URI, Configuration). Since hive metastore is a > service that deals with a lot of Path Objects and hence needs to do a lot of > calls to FileSystem.get, it's the one to first shows these symptoms. > We're seeing similar symptoms in AM for long-running jobs (for both Tez AM > and MR AM). > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org