DomGarguilo commented on issue #1791:
URL: https://github.com/apache/accumulo/issues/1791#issuecomment-767621584


   > Hmm, based on where it's failing, it seems like it's possible that not all 
tablets are yet marked as suspended. The loop condition on line 213 only waits 
until there's _some_ suspended tablets for at least two tservers that were 
killed. However, it doesn't loop until _all_ tablets hosted on those tservers 
are marked as suspended, which is the check on line 225 that is failing.
   
   It does seem like a good idea to wait until _all_ of the suspended tablets 
are marked as suspended before the assert on 225. I'm not sure that this is 
contributing to the test failure however, because the assert on line 225 
compares the tablet locations before and after the tservers are killed to make 
sure that the location of the suspended tablets are the same as their hosted 
location before server death. The problem is that some of the locations of the 
suspended tablets are unexpected, that is, the suspended locations are on a 
different server than their hosted locations. Almost all of the time, all of 
the tablets are accounted for, they are just not in the expected locations.
   
   > The comment above line 225 says "All suspended tablets should "belong" to 
the dead tablet servers". However, it's not actually checking for that... what 
it's checking for is that "all tablets belonging to the dead tablet servers are 
now suspended"... but that's not a condition that we waited for in the loop at 
line 213.
   
   From what I can tell, the tablet locations of all servers are recorded, two 
servers are killed, then the suspended tablets on those dead servers are 
compared to see if they are in the same location as before the servers were 
killed.
   
   > We can probably wait for all tablets assigned to the dead tserver to be 
marked as suspended OR we can relax the check on line 225 to only verify what 
is in the comment, and not assume that all of them that had been assigned to 
that tserver have been suspended yet.
   
   I think its a good idea to wait for all tablets to be marked as suspended 
and from what I can tell, the assert does attempt to verify what is claimed in 
the comment. At the point of the assert, there are generally no tablets that 
are not suspended.
   
   > This is just a hunch, though, that not all the tablets have yet been 
marked as suspended. You could verify this hunch by checking the state of the 
other tablets that were assigned to the dead tserver when the equality check on 
line 225 would have failed (something like replacing `assertEquals` with `if 
not equals, then dump ds.hosted and fail()`).
   
   I have tried doing something like this and posted the results in an [above 
comment](https://github.com/apache/accumulo/issues/1791#issuecomment-763099727).
 It shows the migration of tablets between the two killed servers sometime 
between their death and the measurement of the suspended tablet locations.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to