[ https://issues.apache.org/jira/browse/HBASE-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-18143: -------------------------- Attachment: HBASE-18143.master.002.patch Unrelated. Retrying in meantime. > [AMv2] Backoff on failed report of region transition quickly goes to > astronomical time scale > -------------------------------------------------------------------------------------------- > > Key: HBASE-18143 > URL: https://issues.apache.org/jira/browse/HBASE-18143 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 2.0.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.0.0 > > Attachments: HBASE-18143.master.001.patch, > HBASE-18143.master.002.patch, HBASE-18143.master.002.patch > > > Testing on cluster w/ aggressive killing, if Master is killed serially a few > times such that is offline a good while, regionservers that want to report a > region transition pause too long between retries. > Here is the regionserver reporting failures: > {code} > 1 2017-05-31 20:50:53,840 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#0) after 1008ms delay (Master is coming online...). > 2 2017-05-31 20:50:54,853 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#1) after 2026ms delay (Master is coming online...). > 3 2017-05-31 20:50:56,886 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#2) after 6084ms delay (Master is coming online...). > 4 2017-05-31 20:51:02,976 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#3) after 30588ms delay (Master is coming online...). > 5 2017-05-31 20:51:33,570 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#4) after 308422ms delay (Master is coming online...). > 6 2017-05-31 20:56:41,997 INFO [RS_CLOSE_REGION-ve0542:16020-2] > regionserver.HRegionServer: Failed report of region transition server { > host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 > } transition { transition_code: CLOSED region_info { region_id: 1496284931226 > table_name { namesp ace: "default" qualifier: > "IntegrationTestBigLinkedList" } start_key: > "\337\377\377\377\377\377\377\362" end_key: > "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 > } }; retry (#5) after 6171203ms delay (Master is coming online...). > {code} > See how by the time we get to the 5th retry, we are waiting 100 minutes > before we'll retry. That is too long. Make retry happen more frequently. Data > is offline until the close is successfully reported. -- This message was sent by Atlassian JIRA (v6.3.15#6346)