Sean Broeder created TRAFODION-2236:

             Summary: TM crashesh following sqstart
                 Key: TRAFODION-2236
             Project: Apache Trafodion
          Issue Type: Bug
          Components: dtm
    Affects Versions: 2.0-incubating
            Reporter: Sean Broeder
            Assignee: Sean Broeder
             Fix For: 2.1-incubating

When Trafodion is stopped abruptly when a region server has current recovery 
requests posted in Zookeeper, the new TMs may be unable to start.  This happens 
because the TM recovery thread reads the ZK entries and attempts to send the 
recovery resolution to the region server that posted the entry.  It gets a 
connection error because that region server no longer exists.

The partial solution is to remove the ZK entries as part of startup so the TM 
can startup without error.

THis is safe to do because any region server needing recovery will repost to 
zookeeper and the TM will have no issues connecting to this RS.

An additional fix will be made to the TM to handle exceptions in trying to 
communicate with region servers during recovery.

This message was sent by Atlassian JIRA

Reply via email to