> On July 1, 2015, 7:20 a.m., Adam B wrote: > > I'm sorry, but I don't understand how LOG(FATAL) was segfaulting before. > > Please explain.
Hi Adam, This was reported as a segfault by a user in IRC. The specific segfault was fixed by MESOS-2636. The segfault _was_ happening under an error condition, so after the segfault was fixed, we would _still_ exit with a SIGABRT / stack trace that is generated by a `LOG(FATAL)`. There has been a drive to reduce the number of bad user experiences by converting some of the exit points that SIGABRT / stack trace when just a friendly message to the user would be much more helpful. I acknowledge that `PLOG(FATAL)` already prints out an error message; however, it is rather hidden amongst the other messages and stack trace. Users tend to miss this and complain that Mesos is broken, rather than fixing the user / configuration problem mentioned in the error message. Does this answer your question? > On July 1, 2015, 7:20 a.m., Adam B wrote: > > 3rdparty/libprocess/src/process.cpp, line 899 > > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line899> > > > > I'm confused about how this was segfaulting before. "Logging a FATAL > > message terminates the program (after the message is logged)", so unless > > `ip` is invalid or `ip.error()` doesn't exist (yet `ip.isError()==true`), > > the process should exit cleanly after logging the message. > > > > The original ticket (MESOS-2636) discusses segfaults within the actual > > `getIP()` and `hostname()` methods, so the process would have segfaulted > > before reaching the LOG(FATAL). Now that MESOS-2636 has been fixed, we > > should safely get back an `ip.error()` that we can actually log. Please see my top-level comment. This ticket is confusing because it's a mix of people running old code and carrying the terminology into a rebased patch. This ticket is about cleaning up the communication of the error message, not fixing the segfault. > On July 1, 2015, 7:20 a.m., Adam B wrote: > > 3rdparty/libprocess/src/process.cpp, lines 890-891 > > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line890> > > > > Why doesn't this need to be an `EXIT_(EXIT_FAILURE)` as well? We can consider that as well. The main difference is that `net::getIP()` is returning a `Try<net::IP>`, whereas gethostname is still a raw system call. In an ideal world we would be more sure that something that returns a `Try<T>` for a function like this would only error out if there really was a useful error message for the user and otherwise abort if there was a programmer error. - Joris ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36061/#review90014 ----------------------------------------------------------- On July 1, 2015, 12:32 a.m., Marco Massenzio wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36061/ > ----------------------------------------------------------- > > (Updated July 1, 2015, 12:32 a.m.) > > > Review request for mesos, Adam B and Joris Van Remoortere. > > > Bugs: MESOS-2962 > https://issues.apache.org/jira/browse/MESOS-2962 > > > Repository: mesos > > > Description > ------- > > Jira: MESOS-2962 > > Slave fails with Abort stacktrace when DNS cannot resolve hostname > > If the DNS cannot resolve the hostname for a slave node, we correctly return > an Error object, but we then fail with a segfault. > > This code adds a more user-friendly message and exits normally (with an > `EXIT_FAILURE` code). > For example, forcing `net::getIp()` to always return an Error, now causes the > slave to exit like this: > ``` > $ ./bin/mesos-slave.sh --master=10.10.1.121:5405 > WARNING: Logging before InitGoogleLogging() is written to STDERR > E0630 11:31:45.777465 1944417024 process.cpp:899] Could not obtain the IP > address for stratos.local; the DNS service may not be able to resolve it: >>> > Marco was here!!! > > $ echo $? > 1 > ``` > > > Diffs > ----- > > 3rdparty/libprocess/src/process.cpp > d99947c1598c43c47c88ef3e8038081855f0d1dc > > Diff: https://reviews.apache.org/r/36061/diff/ > > > Testing > ------- > > make check > and manual failing the DNS > > > Thanks, > > Marco Massenzio > >
