Ethan Li created FLINK-9684:
-------------------------------
Summary: HistoryServerArchiveFetcher not working properly with
secure hdfs cluster
Key: FLINK-9684
URL: https://issues.apache.org/jira/browse/FLINK-9684
Project: Flink
Issue Type: Bug
Affects Versions: 1.4.2
Reporter: Ethan Li
With my current setup, jobmanager and taskmanager are able to talk to hdfs
cluster (with kerberos setup). However, running history server gets:
{code:java}
2018-06-27 19:03:32,080 WARN org.apache.hadoop.ipc.Client - Exception
encountered while connecting to the server :
java.lang.IllegalArgumentException: Failed to specify server's Kerberos
principal name
2018-06-27 19:03:32,085 ERROR
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher -
Failed to access job archive location for path
hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive.
java.io.IOException: Failed on local exception: java.io.IOException:
java.lang.IllegalArgumentException: Failed to specify server's Kerberos
principal name; Host Details : local host is:
"openstorm10blue-n2.blue.ygrid.yahoo.com/10.215.79.35"; destination host is:
"openqe11blue-n2.blue.ygri
d.yahoo.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1414)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:515)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1726)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:650)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.listStatus(HadoopFileSystem.java:146)
at
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to
specify server's Kerberos principal name
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 28 more
{code}
Changed LOG Level to DEBUG and seeing
{code:java}
2018-06-27 19:03:30,931 INFO
org.apache.flink.runtime.webmonitor.history.HistoryServer - Enabling SSL for
the history server.
2018-06-27 19:03:30,931 DEBUG org.apache.flink.runtime.net.SSLUtils - Creating
server SSL context from configuration
2018-06-27 19:03:31,091 DEBUG org.apache.flink.core.fs.FileSystem - Loading
extension file systems via services
2018-06-27 19:03:31,094 DEBUG org.apache.flink.core.fs.FileSystem - Added file
system maprfs:org.apache.flink.runtime.fs.maprfs.MapRFsFactory
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils -
Cannot find hdfs-default configuration-file path in Flink config.
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils -
Cannot find hdfs-site configuration-file path in Flink config.
2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - Could
not find Hadoop configuration via any of the supported methods (Flink
configuration, environment variables).
2018-06-27 19:03:31,178 DEBUG org.apache.flink.runtime.fs.hdfs.HadoopFsFactory
- Instantiating for file system scheme hdfs Hadoop File System
org.apache.hadoop.hdfs.DistributedFileSystem
2018-06-27 19:03:31,829 INFO
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher -
Monitoring directory
hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive
for archived jobs.
{code}
The root cause is
https://github.com/apache/flink/blob/release-1.4.2/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L169
{code:java}
FileSystem refreshFS = refreshPath.getFileSystem();
{code}
The getFileSystem() is being called before
{code:java}
FileSystem.initialize(xxx){code}
ever happened.
So it will call
[https://github.com/apache/flink/blob/release-1.4.2/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L388-L390]
{code:java}
if (FS_FACTORIES.isEmpty()) {
initialize(new Configuration());
}
{code}
and because the configuration is empty, it won't be able to connect to hdfs
correctly.
A workaround is to set HADOOP_CONF_DIR or HADOOP_HOME environment variables.
But we should fix this since we have
{code:java}
fs.hdfs.hadoopconf
{code}
config, otherwise it will be confusing to users.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)