On Sun, Dec 22, 2019 at 3:33 PM Raymond, David <[email protected]>
wrote:
> I am running openmpi-4.0.2 (self-compiled with GDS patches) on
> up-to-date 6.6 stable with a Go program that calls Clang MPI routines.
> With particular hardware (details provided if desired), readv and
> writev calls randomly fail with respectively "Timeout" and "Permission
> denied" errors for calls from one machine to another across the
> ethernet.
While "Permission denied" is the error message for EACCES, "Timeout" is not
a complete errno error message OpenBSD. Has it been established that the
underlying readv/writev syscalls are returning particular errors by using
ktrace/kdump?
Next: if you have a device open, then the device driver *totally controls*
what errnos syscalls get. If a device driver wanted to return EDOM
("Numerical argument out of domain") it totally could. If you're getting
weird errno from a device, well, review the device source!
The errors don't occur between cores on the same machine.
>
THIS SHOULD NOT BE A SURPRISE: the net is not the same as your local
machine.
The man pages for readv and writev don't document the possibility of
> such errors.
IMO, weird errnos from devices should be documented in the manpage for the
device. Consider the termios(4) manpage, for example.
Philip Guenther