[ 
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anh Dinh updated SINGA-11:
--------------------------
    Summary: Start SINGA on Apache Mesos  (was: Start SINGA using Mesos)

> Start SINGA on Apache Mesos
> ---------------------------
>
>                 Key: SINGA-11
>                 URL: https://issues.apache.org/jira/browse/SINGA-11
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables 
> resource sharing in the
> same cluster. Mesos abstracts out the physical configurations of cluster 
> nodes, and presents
> resources to the users in the form of "offers". SINGA uses Mesos for two 
> purposes:
> 1. To acquire necessary resources for training the model.
> 2. To launch and monitor progress of the training task.
> To this end, we implement a "SINGA Scheduler" which interacts with Mesos 
> master. The scheduler
> assumes that SINGA has been installed  at the Mesos slave nodes. The 
> scheduler is called when the
> user wants to start a new SINGA job, and it performs the following steps:
> Step 1. Read the job configuration file to determine necessary resources in 
> terms of CPUs, memory and storage.
> Step 2. Wait for resource offers from the Mesos master.
> Step 3. Determine if the offers meet the requirement of resources.
> Step 4. Prepare the task to launch at each slave:
>                       + Deliver the job configuration file to the slave node.
>                       + Specify the command to run on the     slave: "singa 
> -conf ./job.conf"
> Step 5: Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs 
> offered by each Mesos slave
> exceed the total number of SINGA worker and SINGA server per process. In 
> other words, each Mesos
> slave must be able to run the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to 
> each slave.
> Particularly, we write the file to a known directory (different for each job) 
> on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the 
> task.
> We will create a README.md file explaining the steps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to