[
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749651#comment-16749651
]
Allan Yang commented on HBASE-21751:
------------------------------------
{quote}
Why meta region can not online forever? I do not get the point, if the RS is
crashed, the region will be assigned to another RS?
{quote}
Yes, if the server crash, it will recover, but the problem is that it won't
crash just because of opening region fails.
The case is like this:
Two RS, meta is on rs1,
1.hdfs disk full(or other glitches), rs1 roll log fails and crash, meta region
begin to re-assign
2.meta region try to open on rs2, but it fails because of this issue
3.restart rs1, meta region try to open on rs1, also failed because of this issue
4. hdfs disk full recovered, but a 0 size wal left in the WAL dir makes neither
of the RS can open meta region, and they won't crash.
> WAL creation fails during region open may cause region assign forever fail
> --------------------------------------------------------------------------
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.2, 2.0.4
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file,
> but if the wal creation fails, in some cases, HDFS will leave a empty file in
> the dir(e.g. disk full, file is created succesfully but block allocation
> fails). We have a check in AbstractFSWAL that if WAL belong to the same
> factory exists, then a error will be throw. Thus, the region can never be
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0]
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.<init>(AbstractFSWAL.java:382)
> at
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.<init>(AsyncFSWAL.java:210)
> at
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)