[jira] [Commented] (SINGA-48) Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers from diff groups

ASF subversion and git services (JIRA) Wed, 12 Aug 2015 23:49:10 -0700

    [ 
https://issues.apache.org/jira/browse/SINGA-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694799#comment-14694799
 ]


ASF subversion and git services commented on SINGA-48:
------------------------------------------------------

Commit 039de8b0ad481f53502af84296d2947464f9ad11 in incubator-singa's branch 
refs/heads/master from wang sheng
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=039de8b ]

SINGA-48 Fix a bug in trainer.cc that assigns the same NeuralNet instance to 
workers from diff groups

merge to master


> Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers 
> from diff groups
> --------------------------------------------------------------------------------------------
>
>                 Key: SINGA-48
>                 URL: https://issues.apache.org/jira/browse/SINGA-48
>             Project: Singa
>          Issue Type: Bug
>            Reporter: wangwei
>
> In SINGA, workers from the same group and in the same process share the same 
> NeuralNet instance. Different worker groups should have different NeuralNet 
> objects However, the current Trainer::SetupWorkerServer function assigns the 
> same NeuralNet instance to workers in different groups. Consequently, two 
> workers may compute for the same layer instance which would lead to repeated 
> calling of ComputeFeature and ComputeGradient functions, and case run-time 
> errors.
> Another issue is that if two workers from different groups but resident in 
> the same process, they would share memory for layer parameters. The memory 
> sharing has no problem if the group size is 1. But if there are more than 1 
> workers in a group, they should run synchronously. The synchronization is 
> controlled by parameter version. If memory sharing is enabled, workers from 
> other groups may increase the parameter version that leads to errors in 
> synchronization. To fix this issue, SINGA needs to disable memory sharing 
> among groups if worker group size >1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SINGA-48) Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers from diff groups

Reply via email to