[
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anh Dinh updated SINGA-11:
--------------------------
Summary: Start SINGA on Apache Mesos (was: Start SINGA using Mesos)
> Start SINGA on Apache Mesos
> ---------------------------
>
> Key: SINGA-11
> URL: https://issues.apache.org/jira/browse/SINGA-11
> Project: Singa
> Issue Type: New Feature
> Reporter: wangwei
> Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables
> resource sharing in the
> same cluster. Mesos abstracts out the physical configurations of cluster
> nodes, and presents
> resources to the users in the form of "offers". SINGA uses Mesos for two
> purposes:
> 1. To acquire necessary resources for training the model.
> 2. To launch and monitor progress of the training task.
> To this end, we implement a "SINGA Scheduler" which interacts with Mesos
> master. The scheduler
> assumes that SINGA has been installed at the Mesos slave nodes. The
> scheduler is called when the
> user wants to start a new SINGA job, and it performs the following steps:
> Step 1. Read the job configuration file to determine necessary resources in
> terms of CPUs, memory and storage.
> Step 2. Wait for resource offers from the Mesos master.
> Step 3. Determine if the offers meet the requirement of resources.
> Step 4. Prepare the task to launch at each slave:
> + Deliver the job configuration file to the slave node.
> + Specify the command to run on the slave: "singa
> -conf ./job.conf"
> Step 5: Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs
> offered by each Mesos slave
> exceed the total number of SINGA worker and SINGA server per process. In
> other words, each Mesos
> slave must be able to run the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to
> each slave.
> Particularly, we write the file to a known directory (different for each job)
> on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the
> task.
> We will create a README.md file explaining the steps.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)