Now that the Hadoop native code builds on Solaris I've been chipping
away at all the test failures. About 50% of the failures involve
DomainSocket, either directly or indirectly. That seems to be mainly
because the tests use DomainSocket to do single-node testing, whereas in
production it seems that DomainSocket is less commonly used
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html).
The particular problem on Solaris is that socket read/write timeouts
(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
UNIX domain (PF_UNIX) sockets. Those options are however supported for
PF_INET sockets. That's because the socket implementation on Solaris is
split roughly into two parts, for inet sockets and for STREAMS sockets,
and the STREAMS implementation lacks support for SO_SNDTIMEO and
SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
the host's own IP is slightly better than that of UNIX domain sockets on
Solaris.
I'm investigating getting timeouts supported for PF_UNIX sockets added
to Solaris, but in the meantime I'm also looking how this might be
worked around in Hadoop. One way would be to implement timeouts by
wrapping all the read/write/send/recv etc calls in DomainSocket.c with
either poll() or select().
The basic idea is to add two new fields to DomainSocket.c to hold the
read/write timeouts. On platforms that support SO_SNDTIMEO and
SO_RCVTIMEO these would be unused as setsockopt() would be used to set
the socket timeouts. On platforms such as Solaris the JNI code would use
the values to implement the timeouts appropriately.
To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
current socket IO function calls such as accept(), send(), read() etc
would be replaced with a macros such as HD_ACCEPT. On platforms that
provide timeouts these would just expand to the normal socket functions,
on platforms that don't support timeouts it would expand to wrappers
that implements timeouts for them.
The only caveats are that all code that does anything to a PF_UNIX
socket would *always* have to do so via DomainSocket. As far as I can
tell that's not an issue, but it would have to be borne in mind if any
changes were made in this area.
Before I set about doing this, does the approach seem reasonable?
Thanks,
--
Alan Burlison
--