[ 
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966270#comment-14966270
 ] 

ASF subversion and git services commented on SINGA-11:
------------------------------------------------------

Commit 5d076e526b8d78478ed969ee0c4c33febc8b9eee in incubator-singa's branch 
refs/heads/master from [~flytosky]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=5d076e5 ]

SINGA-11 Start SINGA on Apache Mesos

Merge branch 'feature-mesos'


> Start SINGA on Apache Mesos
> ---------------------------
>
>                 Key: SINGA-11
>                 URL: https://issues.apache.org/jira/browse/SINGA-11
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables 
> resource sharing in the same cluster. Mesos abstracts out the physical 
> configurations of cluster nodes, and presents resources to the users in the 
> form of "offers". SINGA uses Mesos for two purposes:
> # To acquire necessary resources for training the model.
> # To launch and monitor progress of the training task.
> To these ends, we implement a {{SINGA Scheduler}} which interacts with Mesos 
> master. The scheduler is called when the user wants to start a new SINGA job, 
> and it performs the following steps:
> # Read the job configuration file to determine necessary resources in terms 
> of CPUs, memory and storage.
> # Wait for resource offers from the Mesos master.
> # Determine if the offers meet the requirement of resources.
> # Prepare the task to launch at each slave:
> #* Deliver the job configuration file to the slave node.
> #* Specify the command to run on the slave:
> {code}
> singa -conf ./job.conf
> {code}
> #* Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs 
> offered by each Mesos slave exceed the total number of SINGA worker and SINGA 
> server per process. In other words, each selected slave must be able to run 
> the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to 
> each slave. Particularly, we write the file to a known directory (different 
> for each job) on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the 
> task.
> The development and testing environment for this ticket are created from 
> [SINGA-89|https://issues.apache.org/jira/browse/SINGA-89]
> We will create a {{README.md}} file explaining the steps.
> h5. Important
> We assume that SINGA, Mesos and Hadoop are running at every node. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to