Hi @ChaiBapchya, thank you very much for your reply.
The model is a building change detection model (like a siamese UNet) with
semantic segmentation as the output. The building blocks are unfortunately
complicated (in the sense complicated computation graph - and complication is
bad :( - I apologize for that). I am few weeks before submitting for
publication and doing some last tests, I will be able to share code afterwards.
The problem is model dependent, everything works fine with standard models.
Also, it seems the problem is not horovod dependent, because even a standard
classification model (outside horovod), takes few minutes to launch - in
contrast an identical network in backbone, with resnet building blocks launches
almost immediately. I did this test yesterday.
I just exported the model into a json file (115733 lines), I don't know if it
gives more insight, but says in last 3 lines:
```
],
"heads": [[15266, 0, 0], [15231, 0, 0], [15203, 0, 0], [15270, 0, 0]],
"attrs": {"mxnet_version": ["int", 10600]}
}
```
The environment is HPC local environment, I do my debugging tests by requesting
2 GPUs (P100), 12 processors (Xeon) per process, and 128GB of memory. It seems
these models require a lot of CPU memory as well. mxnet version:
```cu101-1.6.0.dist-info```, cuda 10.1.168
I can provide full system info, but I think the question is: can I load from a
file/in memory to avoid going through this operation every time? I think it's
a GPU issue. When I load the models on cpu they fire up almost instantly.
I also get this warning, that I don't know if it is relative:
```
In [9]: outs = config['net'](xx,xx)
[17:40:48] src/imperative/cached_op.cc:192: Disabling fusion due to altered
topological order of inputs.
```
Again, thank you very much for your time.
Regards
---
[Visit
Topic](https://discuss.mxnet.io/t/very-slow-initialisation-of-gpu-distributed-training/6357/3)
or reply to this email to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.mxnet.io/email/unsubscribe/4bd25c8c5f683fc40eace09b3c1f9a0af4977c8d5ec31e0a1dfb770f0258c904).