Aditya Kishore created HBASE-9737:
-------------------------------------

             Summary: Resource leak if a corrupt HFile presents itself in a 
region leading to Region Server OOM
                 Key: HBASE-9737
                 URL: https://issues.apache.org/jira/browse/HBASE-9737
             Project: HBase
          Issue Type: Bug
          Components: HFile
    Affects Versions: 0.94.12
            Reporter: Aditya Kishore
            Assignee: Aditya Kishore
            Priority: Critical


One of our customer was recently hit with OOM error on almost all of the region 
servers.

Postmortem of the issue reveled that a corrupt HFile had made its way into one 
of the regions which resulted into the region brought offline immediately which 
is as per design.

What happened next reveals two issues:\\
\\
* As soon as the region was offlined, Master noticed this and tried to assign 
the region to another region server which of course failed (again due to the 
corrupt HFile) and then Master tried to assign this to another and so on. So 
this region kept bouncing from one server to another and this went unnoticed 
for few hours and all region servers log were filled with thousands of this 
message:{noformat}org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
 Failed open of
region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f.,
starting to roll back the global memstore size.
java.io.IOException: java.io.IOException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
Trailer from file
/hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
        at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
        at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException:
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
Trailer from file
/hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
        at 
org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017)
        at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525)
        at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
{noformat} For situation like this, the region should be marked 
"offlined_with_error" or something similar so that Master does not try to 
assign it to another server without user fixing the issue. I will create a 
separate JIRA for that.

* The second problem and the scope of this JIRA is that the function 
{{org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion()}} throws exception 
without closing the {{FSDataInputStream}} objects even if closeIStream is set 
to true. This lead to orphan filesystem streams accumulating in region server 
and it eventually died of OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to