[
https://issues.apache.org/jira/browse/SINGA-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694799#comment-14694799
]
ASF subversion and git services commented on SINGA-48:
------------------------------------------------------
Commit 039de8b0ad481f53502af84296d2947464f9ad11 in incubator-singa's branch
refs/heads/master from wang sheng
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=039de8b ]
SINGA-48 Fix a bug in trainer.cc that assigns the same NeuralNet instance to
workers from diff groups
merge to master
> Fix a bug in trainer.cc that assigns the same NeuralNet instance to workers
> from diff groups
> --------------------------------------------------------------------------------------------
>
> Key: SINGA-48
> URL: https://issues.apache.org/jira/browse/SINGA-48
> Project: Singa
> Issue Type: Bug
> Reporter: wangwei
>
> In SINGA, workers from the same group and in the same process share the same
> NeuralNet instance. Different worker groups should have different NeuralNet
> objects However, the current Trainer::SetupWorkerServer function assigns the
> same NeuralNet instance to workers in different groups. Consequently, two
> workers may compute for the same layer instance which would lead to repeated
> calling of ComputeFeature and ComputeGradient functions, and case run-time
> errors.
> Another issue is that if two workers from different groups but resident in
> the same process, they would share memory for layer parameters. The memory
> sharing has no problem if the group size is 1. But if there are more than 1
> workers in a group, they should run synchronously. The synchronization is
> controlled by parameter version. If memory sharing is enabled, workers from
> other groups may increase the parameter version that leads to errors in
> synchronization. To fix this issue, SINGA needs to disable memory sharing
> among groups if worker group size >1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)