[jira] [Created] (FLINK-8066) Changed configuration of taskmanagers should recreate them

2017-11-13 Thread Stephen Gran (JIRA)
Stephen Gran created FLINK-8066:
---

 Summary: Changed configuration of taskmanagers should recreate them
 Key: FLINK-8066
 URL: https://issues.apache.org/jira/browse/FLINK-8066
 Project: Flink
  Issue Type: New Feature
Reporter: Stephen Gran
Priority: Minor


When we redeploy the jobmanager to our mesos cluster with changed parameters 
affecting the taskmanagers (eg, change from 1 CPU per TM to 2 CPUs per TM), the 
existing taskmanagers are reused rather than replaced with new taskmanagers 
with new parameters.

It seems like `recoverWorkers` in 
`org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager` has 
most of the information it would need to be able to perform this convergence, 
and it doesn't seem like a large amount of work to do the check.

My concern with starting to work on the issue there is that there may be a 
higher level, perhaps in `FlinkResourceManager` that should perform this work 
on both mesos and yarn.  The two implementations look quite different, however, 
so this may be an over eager optimisation best left for later.  I'm happy to 
look at a patch for this, but I wanted some input before starting the work to 
see where you thought this should live.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-6408) Repeated loading of configuration files in hadoop filesystem code paths

2017-04-28 Thread Stephen Gran (JIRA)
Stephen Gran created FLINK-6408:
---

 Summary: Repeated loading of configuration files in hadoop 
filesystem code paths
 Key: FLINK-6408
 URL: https://issues.apache.org/jira/browse/FLINK-6408
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Stephen Gran
Priority: Minor


We are running flink on mesos in AWS.  Checkpointing is enabled with an s3 
backend, configured via the hadoop s3a filesystem implementation and done every 
second.

We are seeing roughly 3 million log events per hour from a relatively small 
job, and it appears that this is because every s3 copy event reloads the hadoop 
configuration, which in turn reloads the flink configuration.  The flink 
configuration loader is outputting each key/value pair every time it is 
invoked, leading to this volume of logs.

While the logging is relatively easy to deal with - just a log4j setting - the 
behaviour is probably suboptimal.  It seems that the configuration loader could 
easily be changed over to a singleton pattern to prevent the constant rereading 
of files.

If you're interested, we can probably knock up a patch for this in a relatively 
short time.

Cheers,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6336) Placement Constraints for Mesos

2017-04-20 Thread Stephen Gran (JIRA)
Stephen Gran created FLINK-6336:
---

 Summary: Placement Constraints for Mesos
 Key: FLINK-6336
 URL: https://issues.apache.org/jira/browse/FLINK-6336
 Project: Flink
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 1.2.0
Reporter: Stephen Gran
Priority: Minor


Fenzo supports placement constraints for tasks, and operators expose agent 
attributes to frameworks in the form of attributes about the agent offer.

It would be extremely helpful in our multi-tenant cluster to be able to make 
use of this facility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)