[ 
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anh Dinh updated SINGA-11:
--------------------------
    Description: 
Apache Mesos is a fine-grained cluster management framework which enables 
resource sharing in the same cluster. Mesos abstracts out the physical 
configurations of cluster nodes, and presents resources to the users in the 
form of "offers". SINGA uses Mesos for two purposes:
# To acquire necessary resources for training the model.
# To launch and monitor progress of the training task.

To these ends, we implement a {{SINGA Scheduler}} which interacts with Mesos 
master. The scheduler is called when the user wants to start a new SINGA job, 
and it performs the following steps:
# Read the job configuration file to determine necessary resources in terms of 
CPUs, memory and storage.
# Wait for resource offers from the Mesos master.
# Determine if the offers meet the requirement of resources.
# Prepare the task to launch at each slave:
#* Deliver the job configuration file to the slave node.
#* Specify the command to run on the slave:
{code}
singa -conf ./job.conf
{code}
#* Launch and monitor the progress

For step 3, we currently implement a simple scheme: the number of CPUs offered 
by each Mesos slave exceed the total number of SINGA worker and SINGA server 
per process. In other words, each selected slave must be able to run the entire 
worker group or server group.

For step 4, we currently relies on HDFS to deliver the configuration file to 
each slave. Particularly, we write the file to a known directory (different for 
each job) on HDFS and ask the
slave to use its Fetcher utility to download the file before executing the task.

The development and testing environment for this ticket are created from 
[SINGA-89|https://issues.apache.org/jira/browse/SINGA-89]

We will create a {{README.md}} file explaining the steps.

h5. Important
We assume that SINGA, Mesos and Hadoop are running at every node. 

  was:
Apache Mesos is a fine-grained cluster management framework which enables 
resource sharing in the
same cluster. Mesos abstracts out the physical configurations of cluster nodes, 
and presents
resources to the users in the form of "offers". SINGA uses Mesos for two 
purposes:

1. To acquire necessary resources for training the model.
2. To launch and monitor progress of the training task.

To this end, we implement a "SINGA Scheduler" which interacts with Mesos 
master. The scheduler
assumes that SINGA has been installed  at the Mesos slave nodes. The scheduler 
is called when the
user wants to start a new SINGA job, and it performs the following steps:

Step 1. Read the job configuration file to determine necessary resources in 
terms of CPUs, memory and storage.
Step 2. Wait for resource offers from the Mesos master.
Step 3. Determine if the offers meet the requirement of resources.
Step 4. Prepare the task to launch at each slave:
                        + Deliver the job configuration file to the slave node.
                        + Specify the command to run on the     slave: "singa 
-conf ./job.conf"
Step 5: Launch and monitor the progress

For step 3, we currently implement a simple scheme: the number of CPUs offered 
by each Mesos slave
exceed the total number of SINGA worker and SINGA server per process. In other 
words, each Mesos
slave must be able to run the entire worker group or server group.

For step 4, we currently relies on HDFS to deliver the configuration file to 
each slave.
Particularly, we write the file to a known directory (different for each job) 
on HDFS and ask the
slave to use its Fetcher utility to download the file before executing the task.

We will create a README.md file explaining the steps.


> Start SINGA on Apache Mesos
> ---------------------------
>
>                 Key: SINGA-11
>                 URL: https://issues.apache.org/jira/browse/SINGA-11
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables 
> resource sharing in the same cluster. Mesos abstracts out the physical 
> configurations of cluster nodes, and presents resources to the users in the 
> form of "offers". SINGA uses Mesos for two purposes:
> # To acquire necessary resources for training the model.
> # To launch and monitor progress of the training task.
> To these ends, we implement a {{SINGA Scheduler}} which interacts with Mesos 
> master. The scheduler is called when the user wants to start a new SINGA job, 
> and it performs the following steps:
> # Read the job configuration file to determine necessary resources in terms 
> of CPUs, memory and storage.
> # Wait for resource offers from the Mesos master.
> # Determine if the offers meet the requirement of resources.
> # Prepare the task to launch at each slave:
> #* Deliver the job configuration file to the slave node.
> #* Specify the command to run on the slave:
> {code}
> singa -conf ./job.conf
> {code}
> #* Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs 
> offered by each Mesos slave exceed the total number of SINGA worker and SINGA 
> server per process. In other words, each selected slave must be able to run 
> the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to 
> each slave. Particularly, we write the file to a known directory (different 
> for each job) on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the 
> task.
> The development and testing environment for this ticket are created from 
> [SINGA-89|https://issues.apache.org/jira/browse/SINGA-89]
> We will create a {{README.md}} file explaining the steps.
> h5. Important
> We assume that SINGA, Mesos and Hadoop are running at every node. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to