Hi,
I am trying to run distributed parameter server training with one server
and 5 workers.
I am trying to train the MNIST dataset using an SGD optimizer. It gives me
the following error on the workers.
2022-04-18 11:08:51.638470: E
tensorflow/core/common_runtime/eager/context_distributed_manager.cc:486]
Connection reset by peer
Additional GRPC error information from remote target
/job:ps/replica:0/task:0:
:{"created":"@1650301731.638226382","description":"Error received from peer
ipv4:192.168.1.1:12341","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection
reset by peer","grpc_status":14}
E0418 11:08:51.639363565 31165 completion_queue.cc:244] assertion
failed: queue.num_items() == 0
Can you please help me with this?
Thank you,
Paridhika
--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/bf7f9fcb-fa66-462f-9691-d837b8210bc4n%40googlegroups.com.