[
https://issues.apache.org/jira/browse/METRON-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212957#comment-16212957
]
ASF GitHub Bot commented on METRON-1266:
----------------------------------------
GitHub user nickwallen opened a pull request:
https://github.com/apache/metron/pull/809
METRON-1266 Profiler - SASL Authentication Failed
When running the Profiler on a cluster that has multiple nodes and is
secured by Kerberos, it was observed that the HBaseBolt was unable to write to
HBase. The Storm worker running the HBaseBolt logged the following exception.
This does not occur all the time and does not occur in all environments.
```
2017-10-19 14:51:00.146 o.a.h.h.i.AbstractRpcClient [ERROR] SASL
authentication failed.
The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed
...
```
## Changes
To fix this, the `topology.auto-credentials` property needs to be set on
the Profiler topology when running in a Kerberized environment. This is
similar to how the other topologies, like Enrichment, are already configured.
After finding this, the big mystery for me was why this bug did not cause
this issue in all kerberized environments, all the time. Surely this miss
should break the Profiler when running in any Kerberized environment and so
should have been caught sooner.
The problem is obviously that a ticket cannot be found to authenticate when
attempting to flush profile measurements to HBase. Due to this configuration
miss, the Profiler topology itself is not able to generate Kerberos tickets for
authentication. At the same time, if the ticket cache on the worker node is
already populated with a valid ticket, then this issue will not occur. The
ticket cache can be populated if another process generates a ticket or a user
manually kinits on the same node.
This explains why the problem occurs sporadically and only in some
environments. This issue is less likely to occur in an environment, like Full
Dev, where there are fewer, more active nodes. In this case, it is likely that
some other process or user already pre-populated the ticket cache. In a
larger, multi-node cluster, the ticket cache is less likely to be populated.
That's my working theory at least. Feel free to refute.
## Testing
I tested this by applying the fix in a 12 node Metron cluster. This fixed
the problem and allowed the Profiler to write to HBase. I also tested this in
Full Dev on both plain vanilla mode and after kerberization.
To test this change, follow these steps.
1. Stand-up Full Dev and run the Metron Service Check.
1. Using Ambari change the profiler duration from 15 minutes to 1 minute.
Then restart the Profiler.
1. Create a simple Profile [following these
instructions](https://github.com/apache/metron/tree/master/metron-analytics/metron-profiler#deploying-profiles-with-the-stellar-shell).
1. Wait a few minutes for the Profiler to gather data and flush.
1. In the Stellar REPL, run `PROFILE_GET` to retrieve the data from HBase.
Ensure that data can be retrieved.
1. Kerberize the cluster and run the Metron Service Check.
1. Update the Bro sensor stub at `/opt/sensor-stubs/bin/start-bro-stub` and
include `--security-protocol=SASL_PLAINTEXT` as an argument to the
`kafka-console-producer.sh` command. Then run the script in a terminal so that
data is flowing through Metron. Also see [this for more
information.](https://github.com/apache/metron/blob/master/metron-deployment/Kerberos-manual-setup.md#push-data)
1. Wait a few minutes for the Profiler to gather data and flush.
1. In the Stellar REPL, run `PROFILE_GET` to retrieve the data from HBase.
Ensure that new data is being written, post-kerberization.
## Pull Request Checklist
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [ ] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
- [ ] Have you written or updated unit tests and or integration tests to
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nickwallen/metron METRON-1266
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/809.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #809
----
commit e7718fdd842eb2e2ae879619dd897c5dc1f8ddf5
Author: Nick Allen <[email protected]>
Date: 2017-10-19T19:51:03Z
METRON-1266 Profiler - SASL Authentication Failed
----
> Profiler - SASL Authentication Failed
> -------------------------------------
>
> Key: METRON-1266
> URL: https://issues.apache.org/jira/browse/METRON-1266
> Project: Metron
> Issue Type: Bug
> Affects Versions: 0.4.1
> Reporter: Nick Allen
> Assignee: Nick Allen
> Fix For: Next + 1
>
>
> When running the Profiler on a cluster that has multiple nodes and is secured
> by Kerberos, it was observed that the HBaseBolt was unable to write to HBase.
> The Storm worker running the HBaseBolt logged the following exception. This
> does not occur all the time and does not occur in all environments.
> {code}
> 2017-10-19 14:51:00.146 o.a.h.h.i.AbstractRpcClient [ERROR] SASL
> authentication failed. The most likely cause is missing or invalid
> credentials. Consider 'kinit'.
> javax.security.sasl.SaslException: GSS initiate failed
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> ~[?:1.8.0_144]
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:609)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:735)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:732)
> ~[stormjar.jar:?]
> at java.security.AccessController.doPrivileged(Native Method)
> ~[?:1.8.0_144]
> at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_144]
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:732)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:885)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:854)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1180)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:369)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:343)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
> [stormjar.jar:?]
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
> [stormjar.jar:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_144]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_144]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
> Caused by: org.ietf.jgss.GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos tgt)
> at
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
> ~[?:1.8.0_144]
> at
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
> ~[?:1.8.0_144]
> at
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
> ~[?:1.8.0_144]
> at
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
> ~[?:1.8.0_144]
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
> ~[?:1.8.0_144]
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
> ~[?:1.8.0_144]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
> ~[?:1.8.0_144]
> ... 25 more}}
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)