> On July 1, 2015, 7:20 a.m., Adam B wrote: > > I'm sorry, but I don't understand how LOG(FATAL) was segfaulting before. > > Please explain. > > Joris Van Remoortere wrote: > Hi Adam, > This was reported as a segfault by a user in IRC. The specific segfault > was fixed by MESOS-2636. > The segfault _was_ happening under an error condition, so after the > segfault was fixed, we would _still_ exit with a SIGABRT / stack trace that > is generated by a `LOG(FATAL)`. > > There has been a drive to reduce the number of bad user experiences by > converting some of the exit points that SIGABRT / stack trace when just a > friendly message to the user would be much more helpful. > I acknowledge that `PLOG(FATAL)` already prints out an error message; > however, it is rather hidden amongst the other messages and stack trace. > Users tend to miss this and complain that Mesos is broken, rather than fixing > the user / configuration problem mentioned in the error message. > > Does this answer your question? > > Adam B wrote: > That helps. Could somebody paste an example log/stacktrace (and exit > code) from the current behavior (without this patch), for comparison? > While I agree that we should log something more helpful than > `ip.error()`, I'm not convinced that we need to convert this (or every?) > LOG(FATAL) to an EXIT().
Apologies for being 'economical with words' - I was trying to get this in fast as I knew you were cutting an RC yesterday, and this felt minor/insignificant enough not to warrant too much discussion. Equally, I don't see how this change should be controversial: the outcome is the same (program exits at the point of failure) but with a more user-friendly message, that would enable user to do some self-help debugging, instead of claiming on IRC/dev-list that "Mesos Slave crashes on AWS" (this was the original report phrased) Note that the ip.error() message originally logged is usually a complete non-starter (I think in this case, it only emits a "Name or service not known"). Original message: ``` $ ./mesos-slave.sh --master=127.0.0.1:5050 WARNING: Logging before InitGoogleLogging() is written to STDERR F0630 13:04:11.598757 30555 process.cpp:866] Name or service not known *** Check failure stack trace: *** Aborted (core dumped) ``` > On July 1, 2015, 7:20 a.m., Adam B wrote: > > 3rdparty/libprocess/src/process.cpp, lines 890-891 > > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line890> > > > > Why doesn't this need to be an `EXIT_(EXIT_FAILURE)` as well? > > Joris Van Remoortere wrote: > We can consider that as well. The main difference is that `net::getIP()` > is returning a `Try<net::IP>`, whereas gethostname is still a raw system call. > In an ideal world we would be more sure that something that returns a > `Try<T>` for a function like this would only error out if there really was a > useful error message for the user and otherwise abort if there was a > programmer error. wouldn't this be out-of-scope of this ticket? we can sure create a Jira for this too (tech-debt label, perhaps?) - Marco ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36061/#review90014 ----------------------------------------------------------- On July 1, 2015, 12:32 a.m., Marco Massenzio wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36061/ > ----------------------------------------------------------- > > (Updated July 1, 2015, 12:32 a.m.) > > > Review request for mesos, Adam B and Joris Van Remoortere. > > > Bugs: MESOS-2962 > https://issues.apache.org/jira/browse/MESOS-2962 > > > Repository: mesos > > > Description > ------- > > Jira: MESOS-2962 > > Slave fails with Abort stacktrace when DNS cannot resolve hostname > > If the DNS cannot resolve the hostname for a slave node, we correctly return > an Error object, but we then fail with a segfault. > > This code adds a more user-friendly message and exits normally (with an > `EXIT_FAILURE` code). > For example, forcing `net::getIp()` to always return an Error, now causes the > slave to exit like this: > ``` > $ ./bin/mesos-slave.sh --master=10.10.1.121:5405 > WARNING: Logging before InitGoogleLogging() is written to STDERR > E0630 11:31:45.777465 1944417024 process.cpp:899] Could not obtain the IP > address for stratos.local; the DNS service may not be able to resolve it: >>> > Marco was here!!! > > $ echo $? > 1 > ``` > > > Diffs > ----- > > 3rdparty/libprocess/src/process.cpp > d99947c1598c43c47c88ef3e8038081855f0d1dc > > Diff: https://reviews.apache.org/r/36061/diff/ > > > Testing > ------- > > make check > and manual failing the DNS > > > Thanks, > > Marco Massenzio > >