[jira] [Commented] (HBASE-9737) Resource leak if a corrupt HFile presents itself in a region leading to Region Server OOM

Hadoop QA (JIRA) Wed, 09 Oct 2013 17:14:28 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791027#comment-13791027
 ]


Hadoop QA commented on HBASE-9737:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12607677/HBASE-9737.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

    {color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7507//console

This message is automatically generated.

> Resource leak if a corrupt HFile presents itself in a region leading to 
> Region Server OOM
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-9737
>                 URL: https://issues.apache.org/jira/browse/HBASE-9737
>             Project: HBase
>          Issue Type: Bug
>          Components: HFile
>    Affects Versions: 0.94.12
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>         Attachments: HBASE-9737_0.94.patch, HBASE-9737_0.94.patch, 
> HBASE-9737_0.94.patch, HBASE-9737.patch, HBASE-9737.patch
>
>
> One of our customer was recently hit with OOM error on almost all of the 
> region servers.
> Postmortem of the issue reveled that a corrupt HFile had made its way into 
> one of the regions which resulted into the region brought offline immediately 
> which is as per design.
> What happened next reveals two issues:\\
> \\
> * As soon as the region was offlined, Master noticed this and tried to assign 
> the region to another region server which of course failed (again due to the 
> corrupt HFile) and then Master tried to assign this to another and so on. So 
> this region kept bouncing from one server to another and this went unnoticed 
> for few hours and all region servers log were filled with thousands of this 
> message:{noformat}org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>  Failed open of
> region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at 
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> {noformat} For situation like this, the region should be marked 
> "offlined_with_error" or something similar so that Master does not try to 
> assign it to another server without user fixing the issue. I will create a 
> separate JIRA for that.
> * The second problem and the scope of this JIRA is that the function 
> {{org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion()}} throws 
> exception without closing the {{FSDataInputStream}} objects even if 
> closeIStream is set to true. This lead to orphan filesystem streams 
> accumulating in region server and it eventually died of OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HBASE-9737) Resource leak if a corrupt HFile presents itself in a region leading to Region Server OOM

Reply via email to