[jira] [Created] (SAMZA-376) ApplicationMaster Timeout after LeaderNotAvailableException

JIRA Fri, 08 Aug 2014 16:39:06 -0700

Nicolas Bär created SAMZA-376:
---------------------------------

             Summary: ApplicationMaster Timeout after 
LeaderNotAvailableException
                 Key: SAMZA-376
                 URL: https://issues.apache.org/jira/browse/SAMZA-376
             Project: Samza
          Issue Type: Bug
    Affects Versions: 0.7.0
            Reporter: Nicolas Bär
            Priority: Minor



The application master does not send a heartbeat to the resource manager if the 
leader of the topic is not available. It will retry until the leader is 
available and then send the heartbeat. If the Kafka cluster is busy during this 
time, the leader election might take a moment and the timeout is reached 
resulting in a shutdown of the application master.

I hit this issue on our testbed and received a few follow-up error messages 
after the application master was restarted: 
{quote}
ERROR security.UserGroupInformation: PriviledgedActionException as:baer 
(auth:SIMPLE) 
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 Password not found for ApplicationAttempt appattempt_1407522131931_0001_000001
{quote}
I will investigate in this further, but assume it is better placed at the YARN 
mailing list.

Here is the relevant part from our discussion on IRC (criccomini):
{quote}
SamzaAppMaster
you'll see:       amClient.start
and later,       amClient.stop
the start is starting the YARN AMClient's heartbeat
now
SamzaAppMasterTaskManager
calls assignContainerToSSPTaskNames
in Util
which calls Util.getInputStreamPartitions(config)
and THAT is where Kafka is called
so basically
before amClient.start is called
that getInputStreamPartitiosn method is invoked
which will block on metadata timeouts
until it can get the data it needs
so SamzaAppMaster is constructing SamzaAppMasterTaskManager before it calls 
amClient.start
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SAMZA-376) ApplicationMaster Timeout after LeaderNotAvailableException

Reply via email to