[GitHub] [incubator-mxnet] liu6381810 opened issue #12577: Training with fc and multi-gpu is much slower than single gpu

GitHub Mon, 17 Sep 2018 04:06:03 -0700

Note: Providing complete information in the most concise form is the best way 
to get help. This issue template serves as the checklist for essential 
information to most of the technical issues and bug reports. For non-technical 
issues and feature requests, feel free to present the information in what you 
believe is the best form.


For Q & A and discussion, please start a discussion thread at 
https://discuss.mxnet.io 

## Description
trainning with multi-gpu is much slower than single gpu

## Environment info (Required)
Ubuntu16.04
python2.7
latest mxnet
I use symbol not gloun
8 * V100




## Details:
Actually I need train fm for Recommender Systems，and I don't follow the sparse 
example in mxnet
I use the embedding layer to do this
But I find if I use one V100 with batch_size=8192  The speed is 40w+ sample
and if I change to use 8*V100 with batch_size=8192 which means every V100 got 
batch_size=1024 then I get 1.8W+sample per second
It shows with 8 * V100 it's much slower than one V100
I don't think the limit is data io because I can get 40w+ sample per second 
with one V100

I  also use 40 fc layers to fit data generated by y=ax+b 
2-gpu is also slower than 1-gpu with the same batch_size


So what's the reason about this？I would appreciate it if anyone can give any 
advice!

Below is my symbol

        self.feat_index = mx.symbol.Variable(name="feat_index") #batch * F
        self.feat_value = mx.symbol.Variable(name="feat_value") #batch * F
        self.label = mx.symbol.Variable(name="label")
        
        self.weights = self._initialize_weights()
        

        self.embeddings = mx.symbol.Embedding(self.feat_index,
                                              input_dim = self.feature_size,
                                              output_dim = self.embedding_size, 
name="embed1")
        
        feat_value = mx.symbol.reshape(self.feat_value, [-1, self.field_size, 
1]) #batch * F * 1
        self.embeddings = mx.symbol.broadcast_mul(self.embeddings, feat_value)
        
        # ---------- first order term ----------
        
        self.y_first_order = mx.symbol.Embedding(self.feat_index,
                                                 input_dim = self.feature_size,
                                                 output_dim = 1, name="embed2")
        self.y_first_order = 
mx.symbol.sum(mx.symbol.elemwise_mul(self.y_first_order, feat_value), 2)  # 
None * F

        
        # ---------- second order term ---------------
        
        self.summed_features_emb = mx.symbol.sum(self.embeddings, 1)  # None * K
        self.summed_features_emb_square = 
mx.symbol.square(self.summed_features_emb)  # None * K

        # square_sum part
        self.squared_features_emb = mx.symbol.square(self.embeddings)
        self.squared_sum_features_emb = 
mx.symbol.sum(self.squared_features_emb, 1)  # None * K


        self.y_second_order = 0.5 * 
mx.symbol.elemwise_sub(self.summed_features_emb_square, 
                                                           
self.squared_sum_features_emb)  # None * K
        
        
        self.concat_input = mx.symbol.concat(self.y_first_order, 
self.y_second_order, dim=1)
        self.out = 
mx.symbol.sum(mx.symbol.broadcast_add(mx.symbol.dot(self.concat_input, 
                                                                           
self.weights["concat_projection"]), 
                                                             
self.weights["concat_bias"]),1)
        
        self.model = mx.symbol.LogisticRegressionOutput(self.out, self.label)

[ Full content available at: 
https://github.com/apache/incubator-mxnet/issues/12577 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-mxnet] liu6381810 opened issue #12577: Training with fc and multi-gpu is much slower than single gpu

Reply via email to