[ 
https://issues.apache.org/jira/browse/ACCUMULO-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195120#comment-13195120
 ] 

Keith Turner commented on ACCUMULO-327:
---------------------------------------

We have a possible cause for this problem.  The master has one connection to 
tablet servers that is protected with a lock.  A merge operation asked a tablet 
to split on tablet server X, the split operation was waiting for a !METADATA 
tablet to load.  The master asked tablet server X to load the metadata tablet 
that the split was waiting on, but blocked on the lock held by the split 
operation.  So deadlock, which snowballs and causes the master to kill all the 
tablet server because it thinks they are unresponsive.
                
> master lost all tablet servers
> ------------------------------
>
>                 Key: ACCUMULO-327
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-327
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: running the random walk test on a small cluster
>            Reporter: Eric Newton
>            Assignee: Keith Turner
>
> Master would occasionally take a long time to collect status information from 
> a tablet server.  The connection would timeout after the default 120 second 
> RPC time.  This probably left the connection in a bad state because I am 
> seeing
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 
> but got 0
>         at 
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:445)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:893)
>         at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:876)
> {noformat}
> If the master is unable to collect statistics on the tablet server, it 
> attempts to halt it (as above) and then it removes its lock in zookeeper.
> Eventually, under the pressure of random walk operations, the master killed 
> every tablet server.
> Guess: a lock in the tablet server is delaying status reporting.
> I wrote a script to process the master logs.  It saves each line that refers 
> to the IP address of a tablet server.  When it sees the zookeeper lock has 
> been deleted, it prints the last N lines that refer to that tablet server.
> In 7 out of the 10 cases, a split timed out prior or during the status 
> request failures.
> In 5 cases, the tablet server was hosting the root tablet (a necessary 
> condition when the last server died).
> In 5 cases, the table_table info tablet was being hosted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to