[
https://issues.apache.org/jira/browse/ACCUMULO-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Busbey updated ACCUMULO-2694:
----------------------------------
Status: Patch Available (was: In Progress)
> Offline tables block balancing for online tables
> ------------------------------------------------
>
> Key: ACCUMULO-2694
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2694
> Project: Accumulo
> Issue Type: Bug
> Components: master
> Affects Versions: 1.5.0, 1.4.0, 1.6.0
> Environment: 1.6.0-RC2 Started CI with a 10-tablet pre-split table.
> Reporter: Mike Drob
> Assignee: Sean Busbey
> Priority: Critical
> Labels: 16_qa_bug
> Fix For: 1.4.6, 1.5.2, 1.6.1, 1.7.0
>
>
> Both DefaultLoadBalancer and ChaoticLoadBalancer won't balance if there are
> outstanding migrations.
> In this instance, we have offline tables from previous CI runs. one of these
> tables had outstanding migrations
> {noformat}
> 2014-04-17 09:47:34,716 [balancer.TabletBalancer] DEBUG: Scanning tablet
> server a2438.halxg.cloudera.com:10011[544d5edf1fec529] for table 8
> 2014-04-17 09:47:36,217 [balancer.TabletBalancer] DEBUG: Scanning tablet
> server a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] for table 8
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended
> with 4 migrations
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended
> with 0 migrations
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;4aa09c;4a809d:
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] ->
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;7e603;7e4029:
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] ->
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;21a044;21803d:
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] ->
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,223 [master.Master] DEBUG: migration 8;59c02e;59a02b:
> a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] ->
> a2414.halxg.cloudera.com:10011[444d5f6b43ac4aa]
> {noformat}
> Later messages show these tablets being unloaded successfully. However, since
> the table is offline they never get loaded on the new tablet server. This
> means they never leave the queue, so balancing stops.
> As an added complication, this last set of migrations was added after the
> table was already offline. I think this is because there had been unhosted
> tablets which caused a bunch of contention around when balancing would
> finally happen.
> A few needed changes:
> # If the balancer isn't going to balance it needs a log message saying so.
> Ideally, this message should also include information about the outstanding
> migrations that are blocking it.
> # the Migration cleanup thread should look for migrations involving offline
> tables and clear them (I'd prefer this to trying to have the balancer figure
> out if a table is offline or online)
> # When we offline a table, we should probably clear migrations related to
> that table. This isn't strictly necessary if the cleanup thread will get them
> eventually, but it would speed things up.
> Workarounds:
> # migration state is only stored in Master memory, failing over to a
> different master will force recalculation which will not include offline
> tables.
> # if for some reason you can't handle a failure of the current master,
> bringing the involved table back online (which might mean all offline tables)
> will allow migrations to resume. the table must remain online until there are
> no longer migrations involving it.
> # I *think* that if you clone the offline table and then delete the original,
> that will clear the outstanding migrations related to it. I did not test
> this, because the above two options are much better.
> The latter option will cause considerably more churn, especially if the
> offline table isn't actually providing utility.
--
This message was sent by Atlassian JIRA
(v6.2#6252)