[ 
https://issues.apache.org/jira/browse/HBASE-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651948#comment-13651948
 ] 

Dimitri Goldin commented on HBASE-8502:
---------------------------------------

I haven't noticed the region until I ran hbck and according to the hbck run 
itself it just wasn't assigned anywhere anymore (I don't know why yet). 

It seems the actual problem happened somewhere in march during a time I was not 
in the office, but at least one of the daughters seems to have been stuck for a 
while. I think somebody restarted the cluster and ran hbck -repair back then 
attempting to fix the issue. And somehow it went away for a while, meaning that 
the regions just were not onlined.

I didn't discover it until a couple of days ago, since the table in question is 
only updated every couple of weeks.

Fortunately I was able to find some old logs in a backup from mid march 
mentioning failed splits and rollbacks of the parent region including a 
different daughter region (939c1e9d10cc4e97d7284025f20298fb), which seemed to 
have the same problem. I presume both were created somewhere on 2013-03-18 
during the same failed split. But unfortunately they are not explicitly 
mentioned as daughters in the logs, since the split failed. Sadly I do not have 
any logs left between ~2013-03-20 and 2012-05-01, since most were rotated by 
now.

Unfortunately I was also unable to find any mention of the 
79c619508659018ff3ef0887611eb8f7 region in the master-logs from that time.

As to what happened to the parent region after the region split; currently I'm 
really not sure. It is obvious that it was removed at some point in time 
causing inability to online both daughters, even though the parts of the logs 
state, that the splits were rolled back. The last mention of the region
I'm able to find is from 2013-03-18 19:45:00,014 (MASTER, point of no return 
error).

There is also another 'new' question: why have the attempts to online both 
daughters stopped at some
point in time until hbck tried to touch one of them. It's also unclear what 
happened to the second daughter (939...8fb).

Please see attached file in which I tried to collect relevant sections of logs 
from hbck, master and regionserver. I hope this helps more. I will try to find 
even more and update.
                
> Eternally stuck Region after split
> ----------------------------------
>
>                 Key: HBASE-8502
>                 URL: https://issues.apache.org/jira/browse/HBASE-8502
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1
>            Reporter: Dimitri Goldin
>            Priority: Critical
>         Attachments: stuck_region_exception.txt
>
>
> Exact HBase version: 0.92.1-cdh4.1.2
> A couple of days ago I encountered a RIT problem with a single region.
> After an hbck run it started trying to assign a region which has been 
> bouncing between OFFLINE/PENDING_OPEN/OPENING for two days afterwards.
> This was due to a split gone wrong in some way, which led to several 
> reference files being left in the region-directory despite the two relevant 
> HFiles being copies successfully to the daughter.
> I will try to give as many details as possible, but unfortunately I was
> unable to find any information about the split itself.
> Short thread about this issue on the users-ML: 
> http://mail-archives.apache.org/mod_mbox/hbase-user/201305.mbox/%[email protected]%3E
> ===
> Parent region: 5b9c16898a371de58f31f0bdf86b1f8b
> Daughter region in question: 79c619508659018ff3ef0887611eb8f7
> Rough sequence from the logs seems to be the following:
> ===
> * Received request to open region:
> documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.
> * Setting up tabledescriptor config now ...
> * Opening of region {NAME =>
> 'documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.',
>      STARTKEY => '7128586022887322720',
>      ENDKEY => '7130716361635801616',
>      ENCODED => 79c619508659018ff3ef0887611eb8f7,} failed, marking as 
> FAILED_OPEN in ZK
> * File does not exist: 
> /hbase/documents/5b9c16898a371de58f31f0bdf86b1f8b/d/0707b1ec4c6b41cf9174e0d2a1785fe9
>  
> [...]
> ===
> What happened, was that somehow (and that's the question here) the daughters
> region folder contained some left-over reference files were causing the 
> RegionServer to look-up the parent region, which already was deleted.
> original contents of /hbase/documents/79c619508659018ff3ef0887611eb8f7/d:
> ==
> 0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b
> 47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b
> 4f01ecd052ce464d81e79a62ea227d6b
> 4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
> eb7dbb09701d4353be24ca82481c4a7e
> == 
> I attached the full FileNotFound Exception.
> Please let me know if I can provide more information or help otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to