[ 
https://issues.apache.org/jira/browse/HBASE-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-18143:
--------------------------
    Attachment: HBASE-18143.master.002.patch

Unrelated. Retrying in meantime.

> [AMv2] Backoff on failed report of region transition quickly goes to 
> astronomical time scale
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18143
>                 URL: https://issues.apache.org/jira/browse/HBASE-18143
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-18143.master.001.patch, 
> HBASE-18143.master.002.patch, HBASE-18143.master.002.patch
>
>
> Testing on cluster w/ aggressive killing, if Master is killed serially a few 
> times such that is offline a good while, regionservers that want to report a 
> region transition pause too long between retries.
> Here is the regionserver reporting failures:
> {code}
>   1 2017-05-31 20:50:53,840 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#0) after 1008ms delay (Master is coming online...).
>   2 2017-05-31 20:50:54,853 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#1) after 2026ms delay (Master is coming online...).
>   3 2017-05-31 20:50:56,886 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#2) after 6084ms delay (Master is coming online...).
>   4 2017-05-31 20:51:02,976 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#3) after 30588ms delay (Master is coming online...).
>   5 2017-05-31 20:51:33,570 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#4) after 308422ms delay (Master is coming online...).
>   6 2017-05-31 20:56:41,997 INFO  [RS_CLOSE_REGION-ve0542:16020-2] 
> regionserver.HRegionServer: Failed report of region transition server { 
> host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 
> } transition { transition_code: CLOSED region_info { region_id: 1496284931226 
> table_name { namesp    ace: "default" qualifier: 
> "IntegrationTestBigLinkedList" } start_key: 
> "\337\377\377\377\377\377\377\362" end_key: 
> "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 
> } }; retry (#5) after 6171203ms delay (Master is coming online...).
> {code}
> See how by the time we get to the 5th retry, we are waiting 100 minutes 
> before we'll retry. That is too long. Make retry happen more frequently. Data 
> is offline until the close is successfully reported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to