[
https://issues.apache.org/jira/browse/HBASE-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajeshbabu Chintaguntla updated HBASE-12791:
--------------------------------------------
Attachment: HBASE-12791_v3.patch
[~enis]
Thanks for the review. Here is the updated patch.
bq. Can we move the cleaning logic away from RegionStates though (ideally to
SSH, or to a utility method).
We cannot do the cleanup in SSH because in RegionStates#serverOffline we are
removing the transitions of regions need not open. So we don't get the regions
in SPLITTING_NEW in SSH. So I have added utility method in FSUtils to cleanup.
bq. Can you add logging here:
Added the log here.
bq. The hbck change seems costly. We already have all the regions from hdfs and
meta at that point no?
In the current patch making use of regions info already loaded from meta and
hdfs.
Apart from this I have tried to handle cleanup during master startup but not
able to identify the regions in SPLITTING_NEW state because we are not
persisting the daughter regions until unless split commit happen(which is
correct only).
In branch-1 and 0.98 not able to identify regions in SPLITTING_NEW state during
master startup because we are delegating dead server handling to SSH which
mostly depend on meta and in memory state(not reading from zk).
Please review.
> HBase does not attempt to clean up an aborted split when the regionserver
> shutting down
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-12791
> URL: https://issues.apache.org/jira/browse/HBASE-12791
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.98.0
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Rajeshbabu Chintaguntla
> Priority: Critical
> Fix For: 2.0.0, 0.98.10, 1.0.1
>
> Attachments: HBASE-12791.patch, HBASE-12791_v2.patch,
> HBASE-12791_v3.patch
>
>
> HBase not cleaning the daughter region directories from HDFS if region
> server shut down after creating the daughter region directories during the
> split.
> Here the logs.
> -> RS shutdown after creating the daughter regions.
> {code}
> 2014-12-31 09:05:41,406 DEBUG [regionserver60020-splits-1419996941385]
> zookeeper.ZKAssign: regionserver:60020-0x14a9701e53100d1,
> quorum=localhost:2181, baseZNode=/hbase Transitioned node
> 80c665138d4fa32da4d792d8ed13206f from RS_ZK_REQUEST_REGION_SPLIT to
> RS_ZK_REQUEST_REGION_SPLIT
> 2014-12-31 09:05:41,514 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.HRegion: Closing
> t,,1419996880699.80c665138d4fa32da4d792d8ed13206f.: disabling compactions &
> flushes
> 2014-12-31 09:05:41,514 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.HRegion: Updates disabled for region
> t,,1419996880699.80c665138d4fa32da4d792d8ed13206f.
> 2014-12-31 09:05:41,516 INFO
> [StoreCloserThread-t,,1419996880699.80c665138d4fa32da4d792d8ed13206f.-1]
> regionserver.HStore: Closed f
> 2014-12-31 09:05:41,518 INFO [regionserver60020-splits-1419996941385]
> regionserver.HRegion: Closed
> t,,1419996880699.80c665138d4fa32da4d792d8ed13206f.
> 2014-12-31 09:05:49,922 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.MetricsRegionSourceImpl: Creating new MetricsRegionSourceImpl
> for table t dd9731ee43b104da565257ca1539aa8c
> 2014-12-31 09:05:49,922 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.HRegion: Instantiated
> t,,1419996941401.dd9731ee43b104da565257ca1539aa8c.
> 2014-12-31 09:05:49,929 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.MetricsRegionSourceImpl: Creating new MetricsRegionSourceImpl
> for table t 2e40a44511c0e187d357d651f13a1dab
> 2014-12-31 09:05:49,929 DEBUG [regionserver60020-splits-1419996941385]
> regionserver.HRegion: Instantiated
> t,row2,1419996941401.2e40a44511c0e187d357d651f13a1dab.
> Wed Dec 31 09:06:30 IST 2014 Terminating regionserver
> 2014-12-31 09:06:30,465 INFO [Thread-8] regionserver.ShutdownHook: Shutdown
> hook starting; hbase.shutdown.hook=true;
> fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@42d2282e
> {code}
> -> Skipping rollback if RS stopped or stopping so we end up in dirty daughter
> regions in HDFS.
> {code}
> 2014-12-31 09:07:49,547 INFO [regionserver60020-splits-1419996941385]
> regionserver.SplitRequest: Skip rollback/cleanup of failed split of
> t,,1419996880699.80c665138d4fa32da4d792d8ed13206f. because server is stopped
> java.io.InterruptedIOException: Interrupted after 0 tries on 350
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:156)
> {code}
> Because of this hbck always showing inconsistencies.
> {code}
> ERROR: Region { meta => null, hdfs =>
> hdfs://localhost:9000/hbase/data/default/t/2e40a44511c0e187d357d651f13a1dab,
> deployed => } on HDFS, but not listed in hbase:meta or deployed on any
> region server
> ERROR: Region { meta => null, hdfs =>
> hdfs://localhost:9000/hbase/data/default/t/dd9731ee43b104da565257ca1539aa8c,
> deployed => } on HDFS, but not listed in hbase:meta or deployed on any
> region server
> {code}
> If we try to repair then we end up in overlap regions in hbase:meta. and both
> daughter regions and parent are online.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)