Re: Issues 16 and Issue 12

2015-08-03 Thread Darin Johnson
Swapnil,

Looked over both Docs, HA and NM restart.  It's pretty high level so I'll
look forward to the details.  Initial thoughts:

1. Getting framework reconciliation going would likely eliminate certain
issues, such as sendFrameworkMessage being unreliable.  So should be
implemented sooner than later.

2. How stable is the RMStateStore API? If there's changes between versions
of Hadoop, might be best to use Mesos's State API.

3. There was no mention of running two RM's in traditional Hadoop RM HA
(maybe in marathon even), but this should be considered a possibility. That
may have been implicit.

Saw the PR will look at it.

Darin
Hi Darin,

The Myriad HA work will involve work related to issue 16.
I already have the Myriad HA design doc for review.
Your feedback on it would be really helpful.
I also plan to send out for review parts of the Myriad HA implementation
(although it does not address task reconciliation yet). I was planning to
work on it next.

Regards
Swapnil


On Mon, Aug 3, 2015 at 12:08 PM, Darin Johnson dbjohnson1...@gmail.com
wrote:

 Is anyone actively working these?  I'm interested in both of these and
 should have some cycles to work on them.

 One question I have on issue 12 is how the generalize Scheduling Policies
 if we have autoscaling, fine grain scheduling, and fixed resources (with a
 flexup/flexdown option).  Currently it seems as though FGS is embedded
 pretty deeply.  Ideally though we could Have a SchedulerPolicy interface,
 and users could specify the SchedulerPolicy via the Myriad config.

 If I don't get a response, I'll probably start issue 16 as it's straight
 forward and write something up on 12.

 Darin



[jira] [Updated] (MYRIAD-22) Support Mesos Framework Authentication

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-22?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MYRIAD-22:
-
Fix Version/s: (was: March HackWeek)

 Support Mesos Framework Authentication
 --

 Key: MYRIAD-22
 URL: https://issues.apache.org/jira/browse/MYRIAD-22
 Project: Myriad
  Issue Type: Bug
Reporter: Adam B

 See the [Mesos protobuf for 
 Credential|https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L820]
 Also see Marathon's issue/implementation: 
 https://github.com/mesosphere/marathon/issues/638



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MYRIAD-14) Support fine grained scaling.

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-14?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MYRIAD-14:
-
Fix Version/s: (was: March HackWeek)

 Support fine grained scaling.
 ---

 Key: MYRIAD-14
 URL: https://issues.apache.org/jira/browse/MYRIAD-14
 Project: Myriad
  Issue Type: Improvement
Reporter: Santosh Marella
Assignee: Santosh Marella

 Currently myriad supports scaling at Node Manager level. i.e. it launches a 
 Node Manager instance for every resource offer  it receives.
 We would like to support scaling at YARN's container level. i.e. launch a 
 YARN container per mesos resource offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MYRIAD-47) Support custom users in Mesos

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B reopened MYRIAD-47:
--

Reopening to close as duplicate

 Support custom users in Mesos
 -

 Key: MYRIAD-47
 URL: https://issues.apache.org/jira/browse/MYRIAD-47
 Project: Myriad
  Issue Type: Bug
Reporter: Adam B
 Fix For: March HackWeek


 The YAML already specifies what user to start each NM as, but the TaskFactory 
 sets the CommandInfo user as root, and the Scheduler sets the FrameworkInfo 
 user as  (current user). These should be configurable too, and should not 
 force root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MYRIAD-13) High Availability Mode for Myriad

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MYRIAD-13:
-
Fix Version/s: (was: March HackWeek)

 High Availability Mode for Myriad
 -

 Key: MYRIAD-13
 URL: https://issues.apache.org/jira/browse/MYRIAD-13
 Project: Myriad
  Issue Type: Improvement
Reporter: Mohit Soni

 When recovering from a failure, either a ResourceManager/Myriad JVM failure 
 (new process) or a driver crash (same process), Myriad's should be able:
 1. to reconstruct it's existing state during recovery
 2. to reconcile the TaskStatus of non-terminal tasks
 To achieve 1, Myriad need to persist it state externally so that state 
 outlives a Myriad process run. State can be stored either in Zookeeper or 
 Replicated log abstraction provided by Mesos. (Issue MYRIAD-15)
 To achieve 2, Myriad needs to leverage reconciliation feature. Ben Mahler 
 [document|https://gist.github.com/bmahler/18409fc4f052df43f403] on 
 Reconciliation discusses an algorithm which frameworks can use to reconcile 
 tasks. This should be implemented and used until Reconciliation is managed by 
 SchedulerDriver itself. (Issue MYRIAD-16)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MYRIAD-16) Reconcile tasks when recovering from failure

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MYRIAD-16:
-
Fix Version/s: (was: March HackWeek)

 Reconcile tasks when recovering from failure
 

 Key: MYRIAD-16
 URL: https://issues.apache.org/jira/browse/MYRIAD-16
 Project: Myriad
  Issue Type: Improvement
Reporter: Mohit Soni
Assignee: Mohit Soni

 After recovering from failure, reconcile the TaskStatus of non-terminal 
 tasks. 
 Notes:
 Ben Mahler's Reconciliation 
 [document|https://gist.github.com/bmahler/18409fc4f052df43f403], recommends 
 an algorithm to be used by frameworks for reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MYRIAD-15) SchedulerState store for Myriad

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MYRIAD-15:
-
Fix Version/s: (was: March HackWeek)

 SchedulerState store for Myriad
 ---

 Key: MYRIAD-15
 URL: https://issues.apache.org/jira/browse/MYRIAD-15
 Project: Myriad
  Issue Type: Improvement
Reporter: Mohit Soni

 Implement state store for Myriad. 
 Details TBD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MYRIAD-46) Configurable Mesos Role

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B reopened MYRIAD-46:
--

 Configurable Mesos Role
 ---

 Key: MYRIAD-46
 URL: https://issues.apache.org/jira/browse/MYRIAD-46
 Project: Myriad
  Issue Type: Bug
Reporter: Adam B
 Fix For: March HackWeek


 Enables support for resource reservations, as well as weighted fair sharing.
 Just register with a FrameworkInfo using non-default role (not */), and 
 you can then get offered resources reserved just for that role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MYRIAD-37) Distribute executor/NM binaries rather than assume they already exist on-node.

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B reopened MYRIAD-37:
--

 Distribute executor/NM binaries rather than assume they already exist on-node.
 --

 Key: MYRIAD-37
 URL: https://issues.apache.org/jira/browse/MYRIAD-37
 Project: Myriad
  Issue Type: Bug
Reporter: Adam B
 Fix For: March HackWeek


 In a cluster using the HDFS framework, or without HDFS, the YARN/Hadoop 
 binaries may not already be present on the node. In this case, the 
 executor/NM binaries could be distributed via HDFS, S3, etc.
 Dockerization is handled in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MYRIAD-46) Configurable Mesos Role

2015-08-03 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MYRIAD-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B resolved MYRIAD-46.
--
Resolution: Fixed
  Assignee: Shingo Omura

 Configurable Mesos Role
 ---

 Key: MYRIAD-46
 URL: https://issues.apache.org/jira/browse/MYRIAD-46
 Project: Myriad
  Issue Type: Bug
Reporter: Adam B
Assignee: Shingo Omura
 Fix For: March HackWeek


 Enables support for resource reservations, as well as weighted fair sharing.
 Just register with a FrameworkInfo using non-default role (not */), and 
 you can then get offered resources reserved just for that role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Myriad HA state store implementation

2015-08-03 Thread Swapnil Daingade
Hi all,

I have sent a pull request for Myriad HA state store implementation here
https://github.com/mesos/myriad/pull/123
* I want to mention that the design for Myriad HA has been strongly
influenced by Paul Read's
work here https://github.com/pdread100/myriad/tree/issue-13 and here
https://github.com/pdread100/myriad/tree/issue-15

* I have used Paul's code to serialize and deserialize the scheduler state
to the state store (see commit 1)
  I have made minor additions for the frameworkId to be stored and
retrieved.
  I have made sure to commit this code with Paul as the author.

* The pull request stores the Myriad Scheduler state to the DFS
* To use the state store implementation you need to add the following
properties to the yarn-site.xml
  on the RM.

  property
nameyarn.resourcemanager.recovery.enabled/name
valuetrue/value
  /property
  property
nameyarn.resourcemanager.store.class/name

valueorg.apache.hadoop.yarn.server.resourcemanager.recovery.MyriadFileSystemRMStateStore/value
  /property
  property
   nameyarn.resourcemanager.fs.state-store.uri/name
   value/var/mapr/cluster/yarn/rm/system/value !-- Replace this to
desired path --
  /property

You should be able to see a directory structure similar to this on the dfs

hadoop fs -ls /var/mapr/cluster/yarn/rm/system/FSRMStateRoot

Found 4 items
drwxr-xr-x   - mapr mapr  5 2015-07-23 13:40
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMAppRoot
drwxr-xr-x   - mapr mapr 65 2015-08-02 17:19
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMDTSecretManagerRoot
drwxr-xr-x   - mapr mapr  1 2015-07-27 17:21
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMMyriadRoot --- Myriad
state root folder
-rwxr-xr-x   3 mapr mapr  4 2015-07-21 10:37
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMVersionNode

hadoop fs -ls /var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMMyriadRoot

Found 1 items
-rwxr-xr-x   3 mapr mapr 80 2015-07-27 17:21
/var/mapr/cluster/yarn/rm/system/FSRMStateRoot/RMMyriadRoot/MyriadState --
Myriad state file

* This pull request does not do the following (work in progress)
  1. Reconcile state with Mesos Master and restart NMs if they are lost
during Myriad scheduler restart.
  2. In case of FGS, update the RM's view of NM resources for NMs running
containers.
* Detailed design doc for Myriad HA is here
https://docs.google.com/document/d/1BkcDChhOLU5TDU6ZQEpIh-WBKoCwYPPi9OV-__mQlmQ/edit?usp=sharing

Please let me know your thoughts, suggestions etc.

Regards
Swapnil


Myriad HA design doc

2015-08-03 Thread Swapnil Daingade
Hi All,

I am posting a draft of the Myriad HA design doc using Google docs. You
should be able to use the link below to add your comments.

https://docs.google.com/document/d/1BkcDChhOLU5TDU6ZQEpIh-WBKoCwYPPi9OV-__mQlmQ/edit?usp=sharing

Please let me know your thoughts and suggestions.
I should be able to send a pull request for storing and retrieving Myriad
state from the state store shortly.

Regards
Swapnil


Re: Jira ready (mesos/myriad github issues imported)

2015-08-03 Thread Adam Bordelon
I just disabled our github issues link, and added a link to the new JIRA at
the top of the github page.
Please use JIRA from now on for any new issues or updates to existing
issues.
You probably need to create an account if you don't already have one.
Let us know if you need to be added to the Developers group to assign an
issue to yourself.
All committers have JIRA Admin access.

Thanks!

On Sun, Aug 2, 2015 at 6:23 AM, Gavin McDonald gmcdon...@apache.org wrote:


 Hi All,

 mesos/myriad Github issues imported into the ASF Jira.

 as per https://issues.apache.org/jira/browse/INFRA-9516

 Please now stop using Github and start using Jira.

 https://issues.apache.org/jira/browse/MYRIAD

 Please comment on the INFRA-9516 ticket if you spot anything
 or otherwise please confirm all ok and close the ticket.

 Thanks

 Gav…




Re: Issues 16 and Issue 12

2015-08-03 Thread Swapnil Daingade
Hi Darin,

The Myriad HA work will involve work related to issue 16.
I already have the Myriad HA design doc for review.
Your feedback on it would be really helpful.
I also plan to send out for review parts of the Myriad HA implementation
(although it does not address task reconciliation yet). I was planning to
work on it next.

Regards
Swapnil


On Mon, Aug 3, 2015 at 12:08 PM, Darin Johnson dbjohnson1...@gmail.com
wrote:

 Is anyone actively working these?  I'm interested in both of these and
 should have some cycles to work on them.

 One question I have on issue 12 is how the generalize Scheduling Policies
 if we have autoscaling, fine grain scheduling, and fixed resources (with a
 flexup/flexdown option).  Currently it seems as though FGS is embedded
 pretty deeply.  Ideally though we could Have a SchedulerPolicy interface,
 and users could specify the SchedulerPolicy via the Myriad config.

 If I don't get a response, I'll probably start issue 16 as it's straight
 forward and write something up on 12.

 Darin