Nicolas Bär created SAMZA-376:
---------------------------------
Summary: ApplicationMaster Timeout after
LeaderNotAvailableException
Key: SAMZA-376
URL: https://issues.apache.org/jira/browse/SAMZA-376
Project: Samza
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Nicolas Bär
Priority: Minor
The application master does not send a heartbeat to the resource manager if the
leader of the topic is not available. It will retry until the leader is
available and then send the heartbeat. If the Kafka cluster is busy during this
time, the leader election might take a moment and the timeout is reached
resulting in a shutdown of the application master.
I hit this issue on our testbed and received a few follow-up error messages
after the application master was restarted:
{quote}
ERROR security.UserGroupInformation: PriviledgedActionException as:baer
(auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Password not found for ApplicationAttempt appattempt_1407522131931_0001_000001
{quote}
I will investigate in this further, but assume it is better placed at the YARN
mailing list.
Here is the relevant part from our discussion on IRC (criccomini):
{quote}
SamzaAppMaster
you'll see: amClient.start
and later, amClient.stop
the start is starting the YARN AMClient's heartbeat
now
SamzaAppMasterTaskManager
calls assignContainerToSSPTaskNames
in Util
which calls Util.getInputStreamPartitions(config)
and THAT is where Kafka is called
so basically
before amClient.start is called
that getInputStreamPartitiosn method is invoked
which will block on metadata timeouts
until it can get the data it needs
so SamzaAppMaster is constructing SamzaAppMasterTaskManager before it calls
amClient.start
{quote}
--
This message was sent by Atlassian JIRA
(v6.2#6252)