On 2/10/14, 10:33 AM, Bill Havanki wrote:
I used the standard agitation intervals. I don't understand enough about
the system yet to ascertain why tablets stayed unbalanced. One possibility
is the timing of the checks and how that interacted with the 15-minute time
allowance and minimum count:

1. The first failure condition occurred at 11:36, starting the 15-minute
clock.
2. The second failure condition was at the next check 30 minutes later.
3. A rapid succession of checks in the next two minutes pushed the failure
count up high enough.

It's possible that the tablets became balanced, and then unbalanced again,
between steps 1 and 2, so the time allowance was defeated.

Precisely. You could easily have gotten "bad luck" and had some splits right before one of these balances checks which pushed you out of balance. Diagnosing the "why" here is definitely an annoyance but good to do to make sure you didn't stumble on a bug. Typically cross-ref'ing the RW logs to the master log is sufficient to figure out what was happening.

Anyway, I restarted the randomwalk and it ran successfully for over 24
hours with agitation.


On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <[email protected]> wrote:

Interesting - I think I might have run into that once a whole bunch of RW
runs.

I assume you didn't change the agitation intervals from what's in the
example? The parameters as they stand are, I think, acceptable. Being
unbalanced for that long doesn't seem right. Did you identify why you were
unbalanced?

I'm not sure making that configurable is good either as you're now skewing
one randomwalk test to another (in addition to the variance you already
have from resources available).

Personally, if you run into this, and you can identify that there was a
legitimate reason to be unbalanced across that time and those checks, I'd
be more in favor of just restarting that RW client.


On 2/8/14, 11:50 AM, Bill Havanki wrote:

While running 1.5.1 rc1 through randomwalk I hit a failure in the
Concurrent test due to the tablet servers being "unbalanced". See
ACCUMULO-2198 for some background on the last time I ran into this.

What is the general feeling on dealing with this failure? Is a 15-minute
period too short to wait for balancing, or three consecutive failures too
few to allow? I'm using only a 7-node cluster with 5 tservers, maybe an
unbalanced condition is more tolerable then?

The parameters defining "unbalanced" aren't configurable at the moment,
and
I'm inclined to file a JIRA to make them so, to shepherd the test through,
but I'd love to hear what you think about the importance and proper
parameters for this check.

Thanks,
Bill




Reply via email to