Alexey Serbin created KUDU-2679:
-----------------------------------
Summary: In some scenarios, a Spark Kudu application can be devoid
of fresh authn tokens
Key: KUDU-2679
URL: https://issues.apache.org/jira/browse/KUDU-2679
Project: Kudu
Issue Type: Bug
Components: client, security, spark
Affects Versions: 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0
Reporter: Alexey Serbin
When running in {{cluster}} mode, tasks run as a part of Spark Kudu client
application can be devoid of getting new (i.e. non-expired) authentication
tokens even if they run for a very short time. Essentially, if the driver runs
longer than the authn token expiration interval and has a particular pattern of
making RPC calls to Kudu masters and tablet servers, all tasks scheduled to run
after the authn token expiration interval will be supplied with expired authn
tokens, making every task fail. The only way to fix that is restarting the
application or dropping long-established connections from the driver to Kudu
masters/tservers.
Below are some details, explaining why that can happen.
Let's assume the following holds true for a Spark Kudu application:
* The application is running against a secured Kudu cluster.
* The application is running in the {{cluster}} mode.
* There are no primary authentication credentials at the machines for the user
under which the Spark executors are running (i.e. {{kinit}} hasn't been run at
those executor machines for the corresponding user or the Kerberos credentials
has already expired there).
* The {{--authn_token_validity_seconds}} masters' flag is set to {{X}} seconds
(default is 60 * 60 * 24 * 7 seconds, i.e. 7 days).
* The {{--rpc_default_keepalive_time_ms}} flag for masters (and tablet servers,
if they are involved into the communications between the driver process and the
Kudu backend) is set to {{Y}} milliseconds (default is 65000 ms).
* The application is running for longer than {{X}} seconds.
* The driver process makes requests to Kudu masters at least every {{Y}}
milliseconds.
* The driver either doesn't make requests to Kudu tablet servers or makes such
requests at least every {{Y}} milliseconds to each of the involved tablet
servers.
* The executors are running tasks that keep connections to tablet servers idle
for longer than {{Y}} milliseconds or the driver spawns tasks at an executor
after {{Y}} milliseconds since last task has completed by the executor.
Essentially, that's about a Spark Kudu application where the driver process
keeps once opened connections active and the executors need to open new
connections to Kudu tablet servers (and/or masters). Also, the executor
machines doesn't have Kerberos credentials for the OS user under which the
executor processes are run.
In such scenarios, the application's tasks spawned after {{X}} seconds from the
application start will fail because of expired authentication tokens, while the
driver process will never re-acquire its authn token, keeping the expired token
in {{KuduContext}} forever.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)