[ 
https://issues.apache.org/jira/browse/HBASE-9737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798851#comment-13798851
 ] 

Aditya Kishore commented on HBASE-9737:
---------------------------------------

I looked at the callers of the function down few levels and none of them seem 
to look at the type of exception thrown.

> Resource leak if a corrupt HFile presents itself in a region leading to 
> Region Server OOM
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-9737
>                 URL: https://issues.apache.org/jira/browse/HBASE-9737
>             Project: HBase
>          Issue Type: Bug
>          Components: HFile
>    Affects Versions: 0.94.12
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>            Priority: Critical
>             Fix For: 0.98.0, 0.94.13, 0.96.1
>
>         Attachments: HBASE-9737_0.94.patch, HBASE-9737_0.94.patch, 
> HBASE-9737_0.94.patch, HBASE-9737.patch, HBASE-9737.patch
>
>
> One of our customer was recently hit with OOM error on almost all of the 
> region servers.
> Postmortem of the issue reveled that a corrupt HFile had made its way into 
> one of the regions which resulted into the region brought offline immediately 
> which is as per design.
> What happened next reveals two issues:\\
> \\
> * As soon as the region was offlined, Master noticed this and tried to assign 
> the region to another region server which of course failed (again due to the 
> corrupt HFile) and then Master tried to assign this to another and so on. So 
> this region kept bouncing from one server to another and this went unnoticed 
> for few hours and all region servers log were filled with thousands of this 
> message:{noformat}org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>  Failed open of
> region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException:
> org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
> Trailer from file
> /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e
>         at 
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> {noformat} For situation like this, the region should be marked 
> "offlined_with_error" or something similar so that Master does not try to 
> assign it to another server without user fixing the issue. I will create a 
> separate JIRA for that.
> * The second problem and the scope of this JIRA is that the function 
> {{org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion()}} throws 
> exception without closing the {{FSDataInputStream}} objects even if 
> closeIStream is set to true. This lead to orphan filesystem streams 
> accumulating in region server and it eventually died of OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to