[jira] [Comment Edited] (HBASE-19457) Debugging flaky TestTruncateTableProcedure

Appy (JIRA) Thu, 14 Dec 2017 01:38:21 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284510#comment-16284510
 ]


Appy edited comment on HBASE-19457 at 12/14/17 9:37 AM:
--------------------------------------------------------

Although the test timed out after 60 sec, it had at least reached 
TRUNCATE_TABLE_ADD_TO_META state.
One bug i see here is, if master crashes after performing 
TRUNCATE_TABLE_ADD_TO_META, when the new AM comes up, it reads back the meta 
and assumes that these newly added regions are offline and need to be 
reassigned (see logs below). But that's wrong since recovered 
TruncateTableProcedure will try to do the same.

*Edit*: Striking out below text. AM shouldn't be assigning regions from 
incomplete actions. Analysis of case when it does so (below stuff) is 
irrelevant. Fix here should be truncate procedure better managing state of 
table/regions in meta AND/OR assignment manager handling default cases in a 
better way.
--Since there's table lock, they won't execute in parallel, but i think it'll 
be:--
- --Fine if truncate proc executes first. Assign procs will complete 
immediately saying regions are assigned.--
- --Bad if assign procs executes first. Since, when truncate proc will execute 
later, it'll meddle with hbase:meta assuming table is disabled and try to 
assign and fail.--

--An even worst scenario would be: If 1) master crashes, 2)region containing 
meta crashes in between puts and only some puts succeed--
--Then new AM will bring partial regions (since even meta doesn't know about 
all regions) of the table online.--

Waiting for more runs after increasing timeout to get better logs.




was (Author: appy):
Although the test timed out after 60 sec, it had at least reached 
TRUNCATE_TABLE_ADD_TO_META state.
One bug i see here is, if master crashes after performing 
TRUNCATE_TABLE_ADD_TO_META, when the new AM comes up, it reads back the meta 
and assumes that these newly added regions are offline and need to be 
reassigned (see logs below). But that's wrong since recovered 
TruncateTableProcedure will try to do the same.
Since there's table lock, they won't execute in parallel, but i think it'll be:
- Fine if truncate proc executes first. Assign procs will complete immediately 
saying regions are assigned.
- Bad if assign procs executes first. Since, when truncate proc will execute 
later, it'll meddle with hbase:meta assuming table is disabled and try to 
assign and fail.

An even worst scenario would be: If 1) master crashes, 2)region containing meta 
crashes in between puts and only some puts succeed
Then new AM will bring partial regions (since even meta doesn't know about all 
regions) of the table online.

Waiting for more runs after increasing timeout to get better logs.



> Debugging flaky TestTruncateTableProcedure
> ------------------------------------------
>
>                 Key: HBASE-19457
>                 URL: https://issues.apache.org/jira/browse/HBASE-19457
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HBASE-19457) Debugging flaky TestTruncateTableProcedure

Reply via email to