[ 
https://issues.apache.org/jira/browse/ACCUMULO-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084733#comment-14084733
 ] 

Sean Busbey commented on ACCUMULO-2694:
---------------------------------------

At least for now, the [1.6 job has its 
logs|https://builds.apache.org/view/All/job/Accumulo-1.6/ws/minicluster/target/mini-tests/org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/]
 from the failure. Unfortunately, they appear to be useless because they're at 
INFO.

> Offline tables block balancing for online tables
> ------------------------------------------------
>
>                 Key: ACCUMULO-2694
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2694
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0
>         Environment: 1.6.0-RC2 Started CI with a 10-tablet pre-split table. 
>            Reporter: Mike Drob
>            Assignee: Sean Busbey
>            Priority: Critical
>              Labels: 16_qa_bug
>             Fix For: 1.5.2, 1.6.1, 1.7.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Both DefaultLoadBalancer and ChaoticLoadBalancer won't balance if there are 
> outstanding migrations.
> In this instance, we have offline tables from previous CI runs. one of these 
> tables had outstanding migrations
> {noformat}
> 2014-04-17 09:47:34,716 [balancer.TabletBalancer] DEBUG: Scanning tablet 
> server a2438.halxg.cloudera.com:10011[544d5edf1fec529] for table 8
> 2014-04-17 09:47:36,217 [balancer.TabletBalancer] DEBUG: Scanning tablet 
> server a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] for table 8
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
> with 4 migrations
> 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended 
> with 0 migrations
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;4aa09c;4a809d: 
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> 
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;7e603;7e4029: 
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> 
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;21a044;21803d: 
> a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> 
> a2422.halxg.cloudera.com:10011[3451dd2d9fa6761]
> 2014-04-17 09:47:36,223 [master.Master] DEBUG: migration 8;59c02e;59a02b: 
> a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] -> 
> a2414.halxg.cloudera.com:10011[444d5f6b43ac4aa]
> {noformat}
> Later messages show these tablets being unloaded successfully. However, since 
> the table is offline they never get loaded on the new tablet server. This 
> means they never leave the queue, so balancing stops.
> As an added complication, this last set of migrations was added after the 
> table was already offline. I think this is because there had been unhosted 
> tablets which caused a bunch of contention around when balancing would 
> finally happen.
> A few needed changes:
> # If the balancer isn't going to balance it needs a log message saying so. 
> Ideally, this message should also include information about the outstanding 
> migrations that are blocking it.
> # the Migration cleanup thread should look for migrations involving offline 
> tables and clear them (I'd prefer this to trying to have the balancer figure 
> out if a table is offline or online)
> # When we offline a table, we should probably clear migrations related to 
> that table. This isn't strictly necessary if the cleanup thread will get them 
> eventually, but it would speed things up.
> Workarounds:
> # migration state is only stored in Master memory, failing over to a 
> different master will force recalculation which will not include offline 
> tables.
> # if for some reason you can't handle a failure of the current master, 
> bringing the involved table back online (which might mean all offline tables) 
> will allow migrations to resume. the table must remain online until there are 
> no longer migrations involving it.
> # I *think* that if you clone the offline table and then delete the original, 
> that will clear the outstanding migrations related to it. I did not test 
> this, because the above two options are much better.
> The latter option will cause considerably more churn, especially if the 
> offline table isn't actually providing utility.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to