[ 
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962175#comment-14962175
 ] 

Anh Dinh commented on SINGA-11:
-------------------------------

pull request created:

https://github.com/apache/incubator-singa/pull/85

this one depends on SINGA-89 whose pull request has been created. 



> Start SINGA on Apache Mesos
> ---------------------------
>
>                 Key: SINGA-11
>                 URL: https://issues.apache.org/jira/browse/SINGA-11
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables 
> resource sharing in the same cluster. Mesos abstracts out the physical 
> configurations of cluster nodes, and presents resources to the users in the 
> form of "offers". SINGA uses Mesos for two purposes:
> # To acquire necessary resources for training the model.
> # To launch and monitor progress of the training task.
> To these ends, we implement a {{SINGA Scheduler}} which interacts with Mesos 
> master. The scheduler is called when the user wants to start a new SINGA job, 
> and it performs the following steps:
> # Read the job configuration file to determine necessary resources in terms 
> of CPUs, memory and storage.
> # Wait for resource offers from the Mesos master.
> # Determine if the offers meet the requirement of resources.
> # Prepare the task to launch at each slave:
> #* Deliver the job configuration file to the slave node.
> #* Specify the command to run on the slave:
> {code}
> singa -conf ./job.conf
> {code}
> #* Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs 
> offered by each Mesos slave exceed the total number of SINGA worker and SINGA 
> server per process. In other words, each selected slave must be able to run 
> the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to 
> each slave. Particularly, we write the file to a known directory (different 
> for each job) on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the 
> task.
> The development and testing environment for this ticket are created from 
> [SINGA-89|https://issues.apache.org/jira/browse/SINGA-89]
> We will create a {{README.md}} file explaining the steps.
> h5. Important
> We assume that SINGA, Mesos and Hadoop are running at every node. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to