[jira] [Commented] (HBASE-23679) FileSystem instance leaks due to bulk loads with Kerberos enabled

Hudson (Jira) Tue, 14 Jan 2020 16:00:11 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015499#comment-17015499
 ]


Hudson commented on HBASE-23679:
--------------------------------

Results for branch branch-2.2
        [build #754 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/754/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/754//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/754//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/754//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for 
details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/754//artifact/output-integration/hadoop-2.log].
 (note that this means we didn't run on Hadoop 3)


> FileSystem instance leaks due to bulk loads with Kerberos enabled
> -----------------------------------------------------------------
>
>                 Key: HBASE-23679
>                 URL: https://issues.apache.org/jira/browse/HBASE-23679
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 3.0.0, 2.3.0, 2.1.9, 2.2.4
>
>
> Spent the better part of a week chasing an issue on HBase 2.x where the 
> number of DistributedFileSystem instances on the heap of a RegionServer would 
> grow unbounded. Looking at multiple heap-dumps, it was obvious to see that we 
> had an immense number of DFS instances cached (in FileSystem$Cache) for the 
> same user, with the unique number of Tokens contained in that DFS's UGI 
> member (one hbase delegation token, and two HDFS delegation tokens – we only 
> do this for bulk loads). For the user's clusters, they eventually experienced 
> 10x perf degradation as RegionServers spent all of their time in JVM GC (they 
> were unlucky to not have RegionServers crash outright, as this would've, 
> albeit temporarily, fixed the issue).
> The problem seems to be two-fold with changes by HBASE-15291 being largely 
> the cause. This issue tried to close FileSystem instances which were being 
> leaked – however, it did this by instrumenting the method 
> {{SecureBulkLoadManager.cleanupBulkLoad(..)}}. Two big issues with this 
> approach:
>  # It relies on clients to call this method (client's hanging up will leak 
> resources in RegionServers)
>  # This method is only called on the RegionServer hosting the first Region of 
> the table which was bulk-loaded into. For multiple RegionServers, they are 
> left to leak resources.
> HBASE-21342 later tried to fix an issue where FS objects were now being 
> closed prematurely via reference-counting (which appears to work fine), but 
> does not address the other two issues above. Point #2 makes debugging this 
> issue harder than normal because it doesn't manifest on a single node 
> instance :)
> Through all of this, I (re)learned the dirty history of UGI and how its 
> caching doesn't work so great HADOOP-6670. I see trying to continue to 
> leverage the FileSystem$CACHE as a potentially dangerous thing (we've been 
> back here multiple times already). My opinion at this point is that we should 
> cleanly create a new FileSystem instance during the call to 
> {{SecureBulkLoadManager#secureBulkLoadHFiles(..)}} and close it in a finally 
> block in that same method. This both simplifies the lifecycle of a FileSystem 
> instance in the bulk-load codepath but also helps us avoid future problems 
> with UGI and FS caching. The one downside is that we pay the penalty to 
> create a new FileSystem instance, but I'm of the opinion that we cross that 
> bridge when we get there.
> Thanks for [~jdcryans] and [~busbey] for their help along the way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23679) FileSystem instance leaks due to bulk loads with Kerberos enabled

Reply via email to