[jira] [Created] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread tao wang (Jira)
tao wang created FLINK-18744:


 Summary: resume from modified savepoint dirctionary: No such file 
or directory
 Key: FLINK-18744
 URL: https://issues.apache.org/jira/browse/FLINK-18744
 Project: Flink
  Issue Type: Bug
  Components: API / State Processor
Affects Versions: 1.11.1
Reporter: tao wang


If I resume a job from a savepoint which is modified by state processor API, 
such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
the job resumed with savepointpath = /savepoint-path-new  while throwing an 
Exception : 
_*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
I think it's an issue because of flink 1.11 use absolute path in savepoint and 
checkpoint, but state processor API missed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-6341) JobManager can go to definite message sending loop when TaskManager registered

2017-04-20 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6341:
---

 Summary: JobManager can go to definite message sending loop when 
TaskManager registered
 Key: FLINK-6341
 URL: https://issues.apache.org/jira/browse/FLINK-6341
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Reporter: Tao Wang
Assignee: Tao Wang


When TaskManager register to JobManager, JM will send a "NotifyResourceStarted" 
message to kick off Resource Manager, then trigger a reconnection to resource 
manager through sending a "TriggerRegistrationAtJobManager".

When the ref of resource manager in JobManager is not None and the reconnection 
is to same resource manager, JobManager will go to a infinite message sending 
loop which will always sending himself a "ReconnectResourceManager" every 2 
seconds.

We have already observed that phonomenon. More details, check how JobManager 
handles `ReconnectResourceManager`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6312) Update curator version to 2.12.0

2017-04-17 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6312:
---

 Summary: Update curator version to 2.12.0
 Key: FLINK-6312
 URL: https://issues.apache.org/jira/browse/FLINK-6312
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Reporter: Tao Wang
Assignee: Tao Wang


As there's a Major bug in curator release used by flink, we need to update the 
release to 2.12.0 to avoid potential block in flink. (flink use recipes in 
checkpoint coordinator and we have already occurred problem in zookeeper 
failover when we're trying to fix 
https://issues.apache.org/jira/browse/FLINK-6174)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6295) use LoadingCache instead of WeakHashMap to lower latency

2017-04-11 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6295:
---

 Summary: use LoadingCache instead of WeakHashMap to lower latency
 Key: FLINK-6295
 URL: https://issues.apache.org/jira/browse/FLINK-6295
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Reporter: Tao Wang


Now in ExecutionGraphHolder, which is used in many handlers, we use a 
WeakHashMap to cache ExecutionGraph(s), which is only sensitive to garbage 
collection.

The latency is too high when JVM do GC rarely, which will make status of jobs 
or its tasks unmatched with the real ones.

LoadingCache is a common used cache implementation from guava lib, we can use 
its time based eviction to lower latency of status update.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6192) reuse zookeeer client created by CuratorFramework

2017-03-27 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6192:
---

 Summary: reuse zookeeer client created by CuratorFramework
 Key: FLINK-6192
 URL: https://issues.apache.org/jira/browse/FLINK-6192
 Project: Flink
  Issue Type: Improvement
  Components: JobManager, YARN
Reporter: Tao Wang
Assignee: Tao Wang


Now in yarn mode, there're three places using zookeeper client(web monitor, 
jobmanager and resourcemanager) in ApplicationMaster/JobManager, while there're 
two in TaskManager. They create new one zookeeper client when they need them.

I believe there're more other places do the same thing, but in one JVM, one 
CuratorFramework is enough for connections to one zookeeper client, so we need 
a singleton to reuse them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6189) Do not use yarn client config to do sanity check

2017-03-25 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6189:
---

 Summary: Do not use yarn client config to do sanity check
 Key: FLINK-6189
 URL: https://issues.apache.org/jira/browse/FLINK-6189
 Project: Flink
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang


Now in client, if #slots is greater than then number of 
"yarn.nodemanager.resource.cpu-vcores" in yarn client config, the submission 
will be rejected.

It makes no sense as the actual vcores of node manager is decided in cluster 
side, but not in client side. If we don't set the config or don't set the right 
value of it(indeed this config is not a mandatory), it should not affect flink 
submission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6174) Introduce a leader election service in yarn mode to make JobManager always available

2017-03-22 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6174:
---

 Summary: Introduce a leader election service in yarn mode to make 
JobManager always available
 Key: FLINK-6174
 URL: https://issues.apache.org/jira/browse/FLINK-6174
 Project: Flink
  Issue Type: Improvement
  Components: JobManager
Reporter: Tao Wang
Assignee: Tao Wang


Now in yarn mode, if we use zookeeper as high availability choice, it will 
create a election service to get a leader depending on zookeeper election.

When zookeeper leader crashes or the connection between JobManager and 
zookeeper instance was broken, JobManager's leadership will be revoked and send 
a Disconnect message to TaskManager, which will cancle all running tasks and 
make them waiting connection rebuild between JM and ZK.

In yarn mode, we have one and only JobManager(AM) in same time, and it should 
be alwasy leader instead of elected through zookeeper. We can introduce a new 
leader election service in yarn mode to achive that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-10 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6020:
---

 Summary: Blob Server cannot hanlde multiple job sumits(with same 
content) parallelly
 Key: FLINK-6020
 URL: https://issues.apache.org/jira/browse/FLINK-6020
 Project: Flink
  Issue Type: Bug
Reporter: Tao Wang
Priority: Critical


In yarn-cluster mode, if we submit one same job multiple times parallelly, the 
task will encounter class load problem and lease occuputation.

Because blob server stores user jars in name with generated sha1sum of those, 
first writes a temp file and move it to finalialize. For recovery it also will 
put them to HDFS with same file name.

In same time, when multiple clients sumit same job with same jar, the local jar 
files in blob server and those file on hdfs will be handled in multiple 
threads(BlobServerConnection), and impact each other.

It's better to have a way to handle this, now two ideas comes up to my head:
1. lock the write operation, or
2. use some unique identifier as file name instead of ( or added up to) sha1sum 
of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5981) SSL version and ciper suites cannot be constrained as configured

2017-03-07 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5981:
---

 Summary: SSL version and ciper suites cannot be constrained as 
configured
 Key: FLINK-5981
 URL: https://issues.apache.org/jira/browse/FLINK-5981
 Project: Flink
  Issue Type: Bug
  Components: Security
Reporter: Tao Wang
Assignee: Tao Wang


I configured ssl and start flink job, but found configured properties cannot 
apply properly:
akka port: only ciper suites apply right, ssl version not
blob server/netty server: both ssl version and ciper suites are not like what I 
configured

I've found out the reason why:
http://stackoverflow.com/questions/11504173/sslcontext-initialization (for blob 
server and netty server)
https://groups.google.com/forum/#!topic/akka-user/JH6bGnWE8kY(for akka ssl 
version, it's fixed in akka 2.4:https://github.com/akka/akka/pull/21078)

I'll fix the issue on blob server and netty server, and it seems like only 
upgrade for akka can solve issue in akka side(we'll consider later as upgrade 
is not a small action).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5916) make env.java.opts.jobmanager and env.java.opts.taskmanager working in YARN mode

2017-02-24 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5916:
---

 Summary: make env.java.opts.jobmanager and 
env.java.opts.taskmanager working in YARN mode
 Key: FLINK-5916
 URL: https://issues.apache.org/jira/browse/FLINK-5916
 Project: Flink
  Issue Type: Improvement
  Components: YARN
Reporter: Tao Wang
Assignee: Tao Wang


Now only env.java.opts works in YARN mode, and it applies both to JM and TM. 
I'd like to make env.java.opts.jobmanager and env.java.opts.taskmanager working 
in YARN mode in addition,  to support fine grained params setting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5904) jobmanager.heap.mb and taskmanager.heap.mb not work in YARN mode

2017-02-24 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5904:
---

 Summary: jobmanager.heap.mb and taskmanager.heap.mb not work in 
YARN mode
 Key: FLINK-5904
 URL: https://issues.apache.org/jira/browse/FLINK-5904
 Project: Flink
  Issue Type: Bug
Reporter: Tao Wang
 Attachments: screenshot-1.png

I set taskmanager.heap.mb to 5120 and jobmanager.heap.mb to 2048, and run 
./yarn-session.sh -n 3, but the YARN only allocates 4GB for this application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5903) taskmanager.numberOfTaskSlots and yarn.containers.vcores did not work well in YARN mode

2017-02-23 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5903:
---

 Summary: taskmanager.numberOfTaskSlots and yarn.containers.vcores 
did not work well in YARN mode
 Key: FLINK-5903
 URL: https://issues.apache.org/jira/browse/FLINK-5903
 Project: Flink
  Issue Type: Bug
  Components: YARN
Reporter: Tao Wang


Now Flink did not respect taskmanager.numberOfTaskSlots and 
yarn.containers.vcores in flink-conf.yaml, but only -s parameter in CLI.

Details is that taskmanager.numberOfTaskSlots is not working in anyway 
andyarn.containers.vcores is only used in requesting container(TM) resources 
but not aware to TM, which means TM will always think it has default(1) Slots 
if -s is not configured.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5902) Some images can not show in IE

2017-02-23 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5902:
---

 Summary: Some images can not show in IE
 Key: FLINK-5902
 URL: https://issues.apache.org/jira/browse/FLINK-5902
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
 Environment: IE
Reporter: Tao Wang


Some images in the Overview page can not show in IE, as it is good in Chrome.

I'm using IE 11, but think same with IE9 I'll paste the screenshot later.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5901) DAG can not show properly in IE

2017-02-23 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5901:
---

 Summary: DAG can not show properly in IE
 Key: FLINK-5901
 URL: https://issues.apache.org/jira/browse/FLINK-5901
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
 Environment: IE 11
Reporter: Tao Wang


The DAG of running jobs can not show properly in IE11(I am using 
11.0.9600.18059, but assuming same with IE9). The description of task is 
not shown within the rectangle.

Chrome is well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5825) In yarn mode, a small pic can not be loaded

2017-02-16 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5825:
---

 Summary: In yarn mode, a small pic can not be loaded
 Key: FLINK-5825
 URL: https://issues.apache.org/jira/browse/FLINK-5825
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend, YARN
Reporter: Tao Wang
Priority: Minor


In yarn mode, the web frontend url is accessed from yarn in format like 
"http://spark-91-206:8088/proxy/application_1487122678902_0015/;, and the 
running job page's url is 
"http://spark-91-206:8088/proxy/application_1487122678902_0015/#/jobs/9440a129ea5899c16e7c1a7e8f2897b3;.

One .png file called "horizontal.png", which is very small can not be loaded in 
that mode, because in "index.styl" it is cited as absolute path.

We should make the path relative.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5818) change checkpoint dir permission to 700 for security reason

2017-02-16 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5818:
---

 Summary: change checkpoint dir permission to 700 for security 
reason
 Key: FLINK-5818
 URL: https://issues.apache.org/jira/browse/FLINK-5818
 Project: Flink
  Issue Type: Improvement
  Components: Security, State Backends, Checkpointing
Reporter: Tao Wang


Now checkpoint directory is made w/o specified permission, so it is easy for 
another user to delete or read files under it, which will cause restore failure 
or information leak.

It's better to lower it down to 700.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5729) add hostname option in SocketWindowWordCount example to be more convenient

2017-02-07 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5729:
---

 Summary: add hostname option in SocketWindowWordCount example to 
be more convenient
 Key: FLINK-5729
 URL: https://issues.apache.org/jira/browse/FLINK-5729
 Project: Flink
  Issue Type: Improvement
  Components: Examples
Reporter: Tao Wang
Priority: Minor


"hostname" option will help users to get data from the right port, otherwise 
the example would fail due to connection refused.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5723) Use "Used" instead of "Initial" to make taskmanager tag more readable

2017-02-06 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5723:
---

 Summary: Use "Used" instead of "Initial" to make taskmanager tag 
more readable
 Key: FLINK-5723
 URL: https://issues.apache.org/jira/browse/FLINK-5723
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Reporter: Tao Wang
Priority: Trivial


Now in JobManager web fronted, the used memory of task managers is presented as 
"Initial" in table header, which actually means "memory used", from codes.

I'd like change it to be more readable, even it is trivial one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5417) Fix the wrong config file name

2017-01-05 Thread Tao Wang (JIRA)
Tao Wang created FLINK-5417:
---

 Summary: Fix the wrong config file name 
 Key: FLINK-5417
 URL: https://issues.apache.org/jira/browse/FLINK-5417
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Tao Wang
Priority: Trivial


As the config file name is conf/flink-conf.yaml, the usage 
"conf/flink-config.yaml" in document is wrong and easy to confuse user. We 
should correct them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)