[
https://issues.apache.org/jira/browse/HBASE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054009#comment-13054009
]
Aaron Kimball commented on HBASE-3872:
--------------------------------------
Stack,
I'm still working on getting more log data available. Hopefully I can get you
more info. Here's something interesting: I believe the parent regions are still
hanging around in HBase regionservers.
hbck reports:
{code}
ERROR: Region UNKNOWN_REGION on <regionserver-redacted>:60020,
key=4f424608930f7b3ae7c05c49e2bac2c1, not on HDFS or in META but deployed on
<regionserver-redacted>:60020
ERROR: Region UNKNOWN_REGION on <regionserver-redacted>:60020,
key=81c5fb35e10f8ef61da78bbba28db7f9, not on HDFS or in META but deployed on
<regionserver-redacted>:60020
{code}
The region keys match those of the split parents which were abandoned when the
"successful" rollback didn't restore the parent entries in {{.META.}}. Is there
a way to force these back to storefiles on disk, and then manually add them to
{{.META.}}?
> Hole in split transaction rollback; edits to .META. need to be rolled back
> even if it seems like they didn't make it
> --------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-3872
> URL: https://issues.apache.org/jira/browse/HBASE-3872
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.90.3
> Reporter: stack
> Assignee: stack
> Priority: Blocker
> Fix For: 0.90.4
>
> Attachments: 3872.txt
>
>
> Saw this interesting one on a cluster of ours. The cluster was configured
> with too few handlers so lots of the phenomeneon where actions were queued
> but then by the time they got into the server and tried respond to the
> client, the client had disconnected because of the timeout of 60 seconds.
> Well, the meta edits for a split were queued at the regionserver carrying
> .META. and by the time it went to write back, the client had gone (the first
> insert of parent offline with daughter regions added as info:splitA and
> info:splitB). The client presumed the edits failed and 'successfully' rolled
> back the transaction (failing to undo .META. edits thinking they didn't go
> through).
> A few minutes later the .META. scanner on master runs. It sees 'no
> references' in daughters -- the daughters had been cleaned up as part of the
> split transaction rollback -- so it thinks its safe to delete the parent.
> Two things:
> + Tighten up check in master... need to check daughter region at least exists
> and possibly the daughter region has an entry in .META.
> + Dependent on the edit that fails, schedule rollback edits though it will
> seem like they didn't go through.
> This is pretty critical one.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira