[
https://issues.apache.org/jira/browse/HADOOP-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830955#comment-16830955
]
Hadoop QA commented on HADOOP-16278:
------------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m
0s{color} | {color:red} The patch doesn't appear to include any new or modified
tests. Please justify why no new tests are needed for this patch. Also please
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}
12m 34s{color} | {color:green} branch has no errors when building and testing
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m
0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}
12m 52s{color} | {color:green} patch has no errors when building and testing
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m
42s{color} | {color:green} hadoop-aws in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m
26s{color} | {color:green} The patch does not generate ASF License warnings.
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 56m 35s{color} |
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HADOOP-16278 |
| JIRA Patch URL |
https://issues.apache.org/jira/secure/attachment/12967557/HADOOP-16278.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient findbugs checkstyle |
| uname | Linux 5a20da19b0ce 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar
18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 4877f0a |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| Test Results |
https://builds.apache.org/job/PreCommit-HADOOP-Build/16207/testReport/ |
| Max. process+thread count | 310 (vs. ulimit of 10000) |
| modules | C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws |
| Console output |
https://builds.apache.org/job/PreCommit-HADOOP-Build/16207/console |
| Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
> With S3 Filesystem, Long Running services End up Doing lot of GC and
> eventually die
> -----------------------------------------------------------------------------------
>
> Key: HADOOP-16278
> URL: https://issues.apache.org/jira/browse/HADOOP-16278
> Project: Hadoop Common
> Issue Type: Bug
> Components: common, hadoop-aws, metrics
> Affects Versions: 3.1.0, 3.1.1, 3.1.2
> Reporter: Rajat Khandelwal
> Priority: Major
> Fix For: 3.1.3
>
> Attachments: HADOOP-16278.patch, Screenshot 2019-04-30 at 12.52.42
> PM.png, Screenshot 2019-04-30 at 2.33.59 PM.png
>
>
> I'll start with the symptoms and eventually come to the cause.
>
> We are using HDP 3.1 and Noticed that every couple of days the Hive Metastore
> starts doing GC, sometimes with 30 minute long pauses. Although nothing is
> collected and the Heap remains fully used.
>
> Next, we looked at the Heap Dump and found that 99% of the memory is taken up
> by one Executor Service for its task queue.
>
> !Screenshot 2019-04-30 at 12.52.42 PM.png!
> The Instance is Created like this:
> {{ private static final ScheduledExecutorService scheduler = Executors}}
> {{ .newScheduledThreadPool(1, new ThreadFactoryBuilder().setDaemon(true)}}
> {{ .setNameFormat("MutableQuantiles-%d").build());}}
>
> So All the instances of MutableQuantiles are using a Shared single threaded
> ExecutorService
> The second thing to notice is this block of code in the Constructor of
> MutableQuantiles:
> {{this.scheduledTask = scheduler.scheduleAtFixedRate(new
> MutableQuantiles.RolloverSample(this), (long)interval, (long)interval,
> TimeUnit.SECONDS);}}
> So As soon as a MutableQuantiles Instance is created, one task is scheduled
> at Fix Rate. Instead of that, it could schedule them at Fixed Delay (Refer
> HADOOP-16248).
> Now coming to why it's related to S3.
>
> S3AFileSystem Creates an instance of S3AInstrumentation, which creates two
> quantiles (related to S3Guard) with 1s(hardcoded) interval and leaves them
> hanging. By hanging I mean perpetually scheduled. As and when new Instances
> of S3AFileSystem are created, two new quantiles are created, which in turn
> create two scheduled tasks and never cancel them. This way number of
> scheduled tasks keeps on growing without ever getting cleaned up, leading to
> GC/OOM/Crash.
>
> MutableQuantiles has a numInfo field which tells things like the name of the
> metric. From the Heapdump, I found one numInfo and traced all objects
> referencing that.
>
> !Screenshot 2019-04-30 at 2.33.59 PM.png!
>
> There seem to be 300K objects of for the same metric
> (S3Guard_metadatastore_throttle_rate).
> As expected, there are other 300K objects for the other MutableQuantiles
> created by S3AInstrumentation class.
> Although the number of instances of S3AInstrumentation class is only 4.
> Clearly, there is a leak. One S3AInstrumentation instance is creating two
> scheduled tasks to be run every second. These tasks are left scheduled and
> not cancelled when S3AInstrumentation.close() is called. Hence, they are
> never cleaned up. GC is also not able to collect them since they are referred
> by the scheduler.
> Who creates S3AInstrumentation instances? S3AFileSystem.initialize(), which
> is called in FileSystem.get(URI, Configuration). Since hive metastore is a
> service that deals with a lot of Path Objects and hence needs to do a lot of
> calls to FileSystem.get, it's the one to first shows these symptoms.
> We're seeing similar symptoms in AM for long-running jobs (for both Tez AM
> and MR AM).
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]