[
https://issues.apache.org/jira/browse/HBASE-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015978#comment-17015978
]
Hudson commented on HBASE-23679:
--------------------------------
Results for branch master
[build #1598 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1598/]: (x)
*{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1598//General_Nightly_Build_Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1598//JDK8_Nightly_Build_Report_(Hadoop2)/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1598//JDK8_Nightly_Build_Report_(Hadoop3)/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> FileSystem instance leaks due to bulk loads with Kerberos enabled
> -----------------------------------------------------------------
>
> Key: HBASE-23679
> URL: https://issues.apache.org/jira/browse/HBASE-23679
> Project: HBase
> Issue Type: Bug
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Critical
> Fix For: 3.0.0, 2.3.0, 2.1.9, 2.2.4
>
>
> Spent the better part of a week chasing an issue on HBase 2.x where the
> number of DistributedFileSystem instances on the heap of a RegionServer would
> grow unbounded. Looking at multiple heap-dumps, it was obvious to see that we
> had an immense number of DFS instances cached (in FileSystem$Cache) for the
> same user, with the unique number of Tokens contained in that DFS's UGI
> member (one hbase delegation token, and two HDFS delegation tokens – we only
> do this for bulk loads). For the user's clusters, they eventually experienced
> 10x perf degradation as RegionServers spent all of their time in JVM GC (they
> were unlucky to not have RegionServers crash outright, as this would've,
> albeit temporarily, fixed the issue).
> The problem seems to be two-fold with changes by HBASE-15291 being largely
> the cause. This issue tried to close FileSystem instances which were being
> leaked – however, it did this by instrumenting the method
> {{SecureBulkLoadManager.cleanupBulkLoad(..)}}. Two big issues with this
> approach:
> # It relies on clients to call this method (client's hanging up will leak
> resources in RegionServers)
> # This method is only called on the RegionServer hosting the first Region of
> the table which was bulk-loaded into. For multiple RegionServers, they are
> left to leak resources.
> HBASE-21342 later tried to fix an issue where FS objects were now being
> closed prematurely via reference-counting (which appears to work fine), but
> does not address the other two issues above. Point #2 makes debugging this
> issue harder than normal because it doesn't manifest on a single node
> instance :)
> Through all of this, I (re)learned the dirty history of UGI and how its
> caching doesn't work so great HADOOP-6670. I see trying to continue to
> leverage the FileSystem$CACHE as a potentially dangerous thing (we've been
> back here multiple times already). My opinion at this point is that we should
> cleanly create a new FileSystem instance during the call to
> {{SecureBulkLoadManager#secureBulkLoadHFiles(..)}} and close it in a finally
> block in that same method. This both simplifies the lifecycle of a FileSystem
> instance in the bulk-load codepath but also helps us avoid future problems
> with UGI and FS caching. The one downside is that we pay the penalty to
> create a new FileSystem instance, but I'm of the opinion that we cross that
> bridge when we get there.
> Thanks for [~jdcryans] and [~busbey] for their help along the way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)