[ https://issues.apache.org/jira/browse/HBASE-9737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802515#comment-13802515 ]
Hudson commented on HBASE-9737: ------------------------------- FAILURE: Integrated in HBase-0.94 #1180 (See [https://builds.apache.org/job/HBase-0.94/1180/]) HBASE-9737 Corrupt HFile cause resource leak leading to Region Server OOM (stack: rev 1534855) * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java > Corrupt HFile cause resource leak leading to Region Server OOM > -------------------------------------------------------------- > > Key: HBASE-9737 > URL: https://issues.apache.org/jira/browse/HBASE-9737 > Project: HBase > Issue Type: Bug > Components: HFile > Affects Versions: 0.94.12 > Reporter: Aditya Kishore > Assignee: Aditya Kishore > Priority: Critical > Fix For: 0.98.0, 0.94.13, 0.96.1 > > Attachments: 9737.096.txt, HBASE-9737_0.94.patch, > HBASE-9737_0.94.patch, HBASE-9737_0.94.patch, HBASE-9737.patch, > HBASE-9737.patch > > > One of our customer was recently hit with OOM error on almost all of the > region servers. > Postmortem of the issue reveled that a corrupt HFile had made its way into > one of the regions which resulted into the region brought offline immediately > which is as per design. > What happened next reveals two issues:\\ > \\ > * As soon as the region was offlined, Master noticed this and tried to assign > the region to another region server which of course failed (again due to the > corrupt HFile) and then Master tried to assign this to another and so on. So > this region kept bouncing from one server to another and this went unnoticed > for few hours and all region servers log were filled with thousands of this > message:{noformat}org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > Failed open of > region=userdata,50743646010,1378139055806.318c533716869574f10615703269497f., > starting to roll back the global memstore size. > java.io.IOException: java.io.IOException: > org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile > Trailer from file > /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e > at > org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:550) > at > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:463) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3835) > at > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3783) > at > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332) > at > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.io.IOException: > org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile > Trailer from file > /hbase/userdata/318c533716869574f10615703269497f/data/a3e2ae39f71441ac92a6563479fb976e > at > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:404) > at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:257) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3017) > at > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:525) > at > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:523) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > {noformat} For situation like this, the region should be marked > "offlined_with_error" or something similar so that Master does not try to > assign it to another server without user fixing the issue. I will create a > separate JIRA for that. > * The second problem and the scope of this JIRA is that the function > {{org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion()}} throws > exception without closing the {{FSDataInputStream}} objects even if > closeIStream is set to true. This lead to orphan filesystem streams > accumulating in region server and it eventually died of OOM. -- This message was sent by Atlassian JIRA (v6.1#6144)