[
https://issues.apache.org/jira/browse/SINGA-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966270#comment-14966270
]
ASF subversion and git services commented on SINGA-11:
------------------------------------------------------
Commit 5d076e526b8d78478ed969ee0c4c33febc8b9eee in incubator-singa's branch
refs/heads/master from [~flytosky]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=5d076e5 ]
SINGA-11 Start SINGA on Apache Mesos
Merge branch 'feature-mesos'
> Start SINGA on Apache Mesos
> ---------------------------
>
> Key: SINGA-11
> URL: https://issues.apache.org/jira/browse/SINGA-11
> Project: Singa
> Issue Type: New Feature
> Reporter: wangwei
> Assignee: Anh Dinh
>
> Apache Mesos is a fine-grained cluster management framework which enables
> resource sharing in the same cluster. Mesos abstracts out the physical
> configurations of cluster nodes, and presents resources to the users in the
> form of "offers". SINGA uses Mesos for two purposes:
> # To acquire necessary resources for training the model.
> # To launch and monitor progress of the training task.
> To these ends, we implement a {{SINGA Scheduler}} which interacts with Mesos
> master. The scheduler is called when the user wants to start a new SINGA job,
> and it performs the following steps:
> # Read the job configuration file to determine necessary resources in terms
> of CPUs, memory and storage.
> # Wait for resource offers from the Mesos master.
> # Determine if the offers meet the requirement of resources.
> # Prepare the task to launch at each slave:
> #* Deliver the job configuration file to the slave node.
> #* Specify the command to run on the slave:
> {code}
> singa -conf ./job.conf
> {code}
> #* Launch and monitor the progress
> For step 3, we currently implement a simple scheme: the number of CPUs
> offered by each Mesos slave exceed the total number of SINGA worker and SINGA
> server per process. In other words, each selected slave must be able to run
> the entire worker group or server group.
> For step 4, we currently relies on HDFS to deliver the configuration file to
> each slave. Particularly, we write the file to a known directory (different
> for each job) on HDFS and ask the
> slave to use its Fetcher utility to download the file before executing the
> task.
> The development and testing environment for this ticket are created from
> [SINGA-89|https://issues.apache.org/jira/browse/SINGA-89]
> We will create a {{README.md}} file explaining the steps.
> h5. Important
> We assume that SINGA, Mesos and Hadoop are running at every node.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)