[ https://issues.apache.org/jira/browse/ACCUMULO-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084733#comment-14084733 ]
Sean Busbey commented on ACCUMULO-2694: --------------------------------------- At least for now, the [1.6 job has its logs|https://builds.apache.org/view/All/job/Accumulo-1.6/ws/minicluster/target/mini-tests/org.apache.accumulo.minicluster.impl.MiniAccumuloClusterImplTest/logs/] from the failure. Unfortunately, they appear to be useless because they're at INFO. > Offline tables block balancing for online tables > ------------------------------------------------ > > Key: ACCUMULO-2694 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2694 > Project: Accumulo > Issue Type: Bug > Components: master > Affects Versions: 1.4.0, 1.5.0, 1.6.0 > Environment: 1.6.0-RC2 Started CI with a 10-tablet pre-split table. > Reporter: Mike Drob > Assignee: Sean Busbey > Priority: Critical > Labels: 16_qa_bug > Fix For: 1.5.2, 1.6.1, 1.7.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Both DefaultLoadBalancer and ChaoticLoadBalancer won't balance if there are > outstanding migrations. > In this instance, we have offline tables from previous CI runs. one of these > tables had outstanding migrations > {noformat} > 2014-04-17 09:47:34,716 [balancer.TabletBalancer] DEBUG: Scanning tablet > server a2438.halxg.cloudera.com:10011[544d5edf1fec529] for table 8 > 2014-04-17 09:47:36,217 [balancer.TabletBalancer] DEBUG: Scanning tablet > server a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] for table 8 > 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended > with 4 migrations > 2014-04-17 09:47:36,222 [balancer.DefaultLoadBalancer] DEBUG: balance ended > with 0 migrations > 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;4aa09c;4a809d: > a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> > a2422.halxg.cloudera.com:10011[3451dd2d9fa6761] > 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;7e603;7e4029: > a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> > a2422.halxg.cloudera.com:10011[3451dd2d9fa6761] > 2014-04-17 09:47:36,222 [master.Master] DEBUG: migration 8;21a044;21803d: > a2438.halxg.cloudera.com:10011[544d5edf1fec529] -> > a2422.halxg.cloudera.com:10011[3451dd2d9fa6761] > 2014-04-17 09:47:36,223 [master.Master] DEBUG: migration 8;59c02e;59a02b: > a2416.halxg.cloudera.com:10011[244d5edf0b0c4ff] -> > a2414.halxg.cloudera.com:10011[444d5f6b43ac4aa] > {noformat} > Later messages show these tablets being unloaded successfully. However, since > the table is offline they never get loaded on the new tablet server. This > means they never leave the queue, so balancing stops. > As an added complication, this last set of migrations was added after the > table was already offline. I think this is because there had been unhosted > tablets which caused a bunch of contention around when balancing would > finally happen. > A few needed changes: > # If the balancer isn't going to balance it needs a log message saying so. > Ideally, this message should also include information about the outstanding > migrations that are blocking it. > # the Migration cleanup thread should look for migrations involving offline > tables and clear them (I'd prefer this to trying to have the balancer figure > out if a table is offline or online) > # When we offline a table, we should probably clear migrations related to > that table. This isn't strictly necessary if the cleanup thread will get them > eventually, but it would speed things up. > Workarounds: > # migration state is only stored in Master memory, failing over to a > different master will force recalculation which will not include offline > tables. > # if for some reason you can't handle a failure of the current master, > bringing the involved table back online (which might mean all offline tables) > will allow migrations to resume. the table must remain online until there are > no longer migrations involving it. > # I *think* that if you clone the offline table and then delete the original, > that will clear the outstanding migrations related to it. I did not test > this, because the above two options are much better. > The latter option will cause considerably more churn, especially if the > offline table isn't actually providing utility. -- This message was sent by Atlassian JIRA (v6.2#6252)