[jira] [Created] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
tao wang created FLINK-18744: Summary: resume from modified savepoint dirctionary: No such file or directory Key: FLINK-18744 URL: https://issues.apache.org/jira/browse/FLINK-18744 Project: Flink Issue Type: Bug Components: API / State Processor Affects Versions: 1.11.1 Reporter: tao wang If I resume a job from a savepoint which is modified by state processor API, such as loading from /savepoint-path-old and writing to /savepoint-path-new, the job resumed with savepointpath = /savepoint-path-new while throwing an Exception : _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. I think it's an issue because of flink 1.11 use absolute path in savepoint and checkpoint, but state processor API missed this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-6341) JobManager can go to definite message sending loop when TaskManager registered
Tao Wang created FLINK-6341: --- Summary: JobManager can go to definite message sending loop when TaskManager registered Key: FLINK-6341 URL: https://issues.apache.org/jira/browse/FLINK-6341 Project: Flink Issue Type: Bug Components: JobManager Reporter: Tao Wang Assignee: Tao Wang When TaskManager register to JobManager, JM will send a "NotifyResourceStarted" message to kick off Resource Manager, then trigger a reconnection to resource manager through sending a "TriggerRegistrationAtJobManager". When the ref of resource manager in JobManager is not None and the reconnection is to same resource manager, JobManager will go to a infinite message sending loop which will always sending himself a "ReconnectResourceManager" every 2 seconds. We have already observed that phonomenon. More details, check how JobManager handles `ReconnectResourceManager`. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6312) Update curator version to 2.12.0
Tao Wang created FLINK-6312: --- Summary: Update curator version to 2.12.0 Key: FLINK-6312 URL: https://issues.apache.org/jira/browse/FLINK-6312 Project: Flink Issue Type: Bug Components: JobManager Reporter: Tao Wang Assignee: Tao Wang As there's a Major bug in curator release used by flink, we need to update the release to 2.12.0 to avoid potential block in flink. (flink use recipes in checkpoint coordinator and we have already occurred problem in zookeeper failover when we're trying to fix https://issues.apache.org/jira/browse/FLINK-6174) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6295) use LoadingCache instead of WeakHashMap to lower latency
Tao Wang created FLINK-6295: --- Summary: use LoadingCache instead of WeakHashMap to lower latency Key: FLINK-6295 URL: https://issues.apache.org/jira/browse/FLINK-6295 Project: Flink Issue Type: Bug Components: Webfrontend Reporter: Tao Wang Now in ExecutionGraphHolder, which is used in many handlers, we use a WeakHashMap to cache ExecutionGraph(s), which is only sensitive to garbage collection. The latency is too high when JVM do GC rarely, which will make status of jobs or its tasks unmatched with the real ones. LoadingCache is a common used cache implementation from guava lib, we can use its time based eviction to lower latency of status update. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6192) reuse zookeeer client created by CuratorFramework
Tao Wang created FLINK-6192: --- Summary: reuse zookeeer client created by CuratorFramework Key: FLINK-6192 URL: https://issues.apache.org/jira/browse/FLINK-6192 Project: Flink Issue Type: Improvement Components: JobManager, YARN Reporter: Tao Wang Assignee: Tao Wang Now in yarn mode, there're three places using zookeeper client(web monitor, jobmanager and resourcemanager) in ApplicationMaster/JobManager, while there're two in TaskManager. They create new one zookeeper client when they need them. I believe there're more other places do the same thing, but in one JVM, one CuratorFramework is enough for connections to one zookeeper client, so we need a singleton to reuse them. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6189) Do not use yarn client config to do sanity check
Tao Wang created FLINK-6189: --- Summary: Do not use yarn client config to do sanity check Key: FLINK-6189 URL: https://issues.apache.org/jira/browse/FLINK-6189 Project: Flink Issue Type: Bug Components: YARN Reporter: Tao Wang Now in client, if #slots is greater than then number of "yarn.nodemanager.resource.cpu-vcores" in yarn client config, the submission will be rejected. It makes no sense as the actual vcores of node manager is decided in cluster side, but not in client side. If we don't set the config or don't set the right value of it(indeed this config is not a mandatory), it should not affect flink submission. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6174) Introduce a leader election service in yarn mode to make JobManager always available
Tao Wang created FLINK-6174: --- Summary: Introduce a leader election service in yarn mode to make JobManager always available Key: FLINK-6174 URL: https://issues.apache.org/jira/browse/FLINK-6174 Project: Flink Issue Type: Improvement Components: JobManager Reporter: Tao Wang Assignee: Tao Wang Now in yarn mode, if we use zookeeper as high availability choice, it will create a election service to get a leader depending on zookeeper election. When zookeeper leader crashes or the connection between JobManager and zookeeper instance was broken, JobManager's leadership will be revoked and send a Disconnect message to TaskManager, which will cancle all running tasks and make them waiting connection rebuild between JM and ZK. In yarn mode, we have one and only JobManager(AM) in same time, and it should be alwasy leader instead of elected through zookeeper. We can introduce a new leader election service in yarn mode to achive that. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly
Tao Wang created FLINK-6020: --- Summary: Blob Server cannot hanlde multiple job sumits(with same content) parallelly Key: FLINK-6020 URL: https://issues.apache.org/jira/browse/FLINK-6020 Project: Flink Issue Type: Bug Reporter: Tao Wang Priority: Critical In yarn-cluster mode, if we submit one same job multiple times parallelly, the task will encounter class load problem and lease occuputation. Because blob server stores user jars in name with generated sha1sum of those, first writes a temp file and move it to finalialize. For recovery it also will put them to HDFS with same file name. In same time, when multiple clients sumit same job with same jar, the local jar files in blob server and those file on hdfs will be handled in multiple threads(BlobServerConnection), and impact each other. It's better to have a way to handle this, now two ideas comes up to my head: 1. lock the write operation, or 2. use some unique identifier as file name instead of ( or added up to) sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5981) SSL version and ciper suites cannot be constrained as configured
Tao Wang created FLINK-5981: --- Summary: SSL version and ciper suites cannot be constrained as configured Key: FLINK-5981 URL: https://issues.apache.org/jira/browse/FLINK-5981 Project: Flink Issue Type: Bug Components: Security Reporter: Tao Wang Assignee: Tao Wang I configured ssl and start flink job, but found configured properties cannot apply properly: akka port: only ciper suites apply right, ssl version not blob server/netty server: both ssl version and ciper suites are not like what I configured I've found out the reason why: http://stackoverflow.com/questions/11504173/sslcontext-initialization (for blob server and netty server) https://groups.google.com/forum/#!topic/akka-user/JH6bGnWE8kY(for akka ssl version, it's fixed in akka 2.4:https://github.com/akka/akka/pull/21078) I'll fix the issue on blob server and netty server, and it seems like only upgrade for akka can solve issue in akka side(we'll consider later as upgrade is not a small action). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5916) make env.java.opts.jobmanager and env.java.opts.taskmanager working in YARN mode
Tao Wang created FLINK-5916: --- Summary: make env.java.opts.jobmanager and env.java.opts.taskmanager working in YARN mode Key: FLINK-5916 URL: https://issues.apache.org/jira/browse/FLINK-5916 Project: Flink Issue Type: Improvement Components: YARN Reporter: Tao Wang Assignee: Tao Wang Now only env.java.opts works in YARN mode, and it applies both to JM and TM. I'd like to make env.java.opts.jobmanager and env.java.opts.taskmanager working in YARN mode in addition, to support fine grained params setting. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5904) jobmanager.heap.mb and taskmanager.heap.mb not work in YARN mode
Tao Wang created FLINK-5904: --- Summary: jobmanager.heap.mb and taskmanager.heap.mb not work in YARN mode Key: FLINK-5904 URL: https://issues.apache.org/jira/browse/FLINK-5904 Project: Flink Issue Type: Bug Reporter: Tao Wang Attachments: screenshot-1.png I set taskmanager.heap.mb to 5120 and jobmanager.heap.mb to 2048, and run ./yarn-session.sh -n 3, but the YARN only allocates 4GB for this application. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5903) taskmanager.numberOfTaskSlots and yarn.containers.vcores did not work well in YARN mode
Tao Wang created FLINK-5903: --- Summary: taskmanager.numberOfTaskSlots and yarn.containers.vcores did not work well in YARN mode Key: FLINK-5903 URL: https://issues.apache.org/jira/browse/FLINK-5903 Project: Flink Issue Type: Bug Components: YARN Reporter: Tao Wang Now Flink did not respect taskmanager.numberOfTaskSlots and yarn.containers.vcores in flink-conf.yaml, but only -s parameter in CLI. Details is that taskmanager.numberOfTaskSlots is not working in anyway andyarn.containers.vcores is only used in requesting container(TM) resources but not aware to TM, which means TM will always think it has default(1) Slots if -s is not configured. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5902) Some images can not show in IE
Tao Wang created FLINK-5902: --- Summary: Some images can not show in IE Key: FLINK-5902 URL: https://issues.apache.org/jira/browse/FLINK-5902 Project: Flink Issue Type: Bug Components: Webfrontend Environment: IE Reporter: Tao Wang Some images in the Overview page can not show in IE, as it is good in Chrome. I'm using IE 11, but think same with IE9 I'll paste the screenshot later. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5901) DAG can not show properly in IE
Tao Wang created FLINK-5901: --- Summary: DAG can not show properly in IE Key: FLINK-5901 URL: https://issues.apache.org/jira/browse/FLINK-5901 Project: Flink Issue Type: Bug Components: Webfrontend Environment: IE 11 Reporter: Tao Wang The DAG of running jobs can not show properly in IE11(I am using 11.0.9600.18059, but assuming same with IE9). The description of task is not shown within the rectangle. Chrome is well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5825) In yarn mode, a small pic can not be loaded
Tao Wang created FLINK-5825: --- Summary: In yarn mode, a small pic can not be loaded Key: FLINK-5825 URL: https://issues.apache.org/jira/browse/FLINK-5825 Project: Flink Issue Type: Bug Components: Webfrontend, YARN Reporter: Tao Wang Priority: Minor In yarn mode, the web frontend url is accessed from yarn in format like "http://spark-91-206:8088/proxy/application_1487122678902_0015/;, and the running job page's url is "http://spark-91-206:8088/proxy/application_1487122678902_0015/#/jobs/9440a129ea5899c16e7c1a7e8f2897b3;. One .png file called "horizontal.png", which is very small can not be loaded in that mode, because in "index.styl" it is cited as absolute path. We should make the path relative. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5818) change checkpoint dir permission to 700 for security reason
Tao Wang created FLINK-5818: --- Summary: change checkpoint dir permission to 700 for security reason Key: FLINK-5818 URL: https://issues.apache.org/jira/browse/FLINK-5818 Project: Flink Issue Type: Improvement Components: Security, State Backends, Checkpointing Reporter: Tao Wang Now checkpoint directory is made w/o specified permission, so it is easy for another user to delete or read files under it, which will cause restore failure or information leak. It's better to lower it down to 700. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5729) add hostname option in SocketWindowWordCount example to be more convenient
Tao Wang created FLINK-5729: --- Summary: add hostname option in SocketWindowWordCount example to be more convenient Key: FLINK-5729 URL: https://issues.apache.org/jira/browse/FLINK-5729 Project: Flink Issue Type: Improvement Components: Examples Reporter: Tao Wang Priority: Minor "hostname" option will help users to get data from the right port, otherwise the example would fail due to connection refused. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5723) Use "Used" instead of "Initial" to make taskmanager tag more readable
Tao Wang created FLINK-5723: --- Summary: Use "Used" instead of "Initial" to make taskmanager tag more readable Key: FLINK-5723 URL: https://issues.apache.org/jira/browse/FLINK-5723 Project: Flink Issue Type: Improvement Components: Webfrontend Reporter: Tao Wang Priority: Trivial Now in JobManager web fronted, the used memory of task managers is presented as "Initial" in table header, which actually means "memory used", from codes. I'd like change it to be more readable, even it is trivial one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5417) Fix the wrong config file name
Tao Wang created FLINK-5417: --- Summary: Fix the wrong config file name Key: FLINK-5417 URL: https://issues.apache.org/jira/browse/FLINK-5417 Project: Flink Issue Type: Bug Components: Documentation Reporter: Tao Wang Priority: Trivial As the config file name is conf/flink-conf.yaml, the usage "conf/flink-config.yaml" in document is wrong and easy to confuse user. We should correct them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)