> On July 1, 2015, 7:20 a.m., Adam B wrote:
> > I'm sorry, but I don't understand how LOG(FATAL) was segfaulting before. 
> > Please explain.
> 
> Joris Van Remoortere wrote:
>     Hi Adam,
>     This was reported as a segfault by a user in IRC. The specific segfault 
> was fixed by MESOS-2636.
>     The segfault _was_ happening under an error condition, so after the 
> segfault was fixed, we would _still_ exit with a SIGABRT / stack trace that 
> is generated by a `LOG(FATAL)`.
>     
>     There has been a drive to reduce the number of bad user experiences by 
> converting some of the exit points that SIGABRT / stack trace when just a 
> friendly message to the user would be much more helpful.
>     I acknowledge that `PLOG(FATAL)` already prints out an error message; 
> however, it is rather hidden amongst the other messages and stack trace. 
> Users tend to miss this and complain that Mesos is broken, rather than fixing 
> the user / configuration problem mentioned in the error message.
>     
>     Does this answer your question?
> 
> Adam B wrote:
>     That helps. Could somebody paste an example log/stacktrace (and exit 
> code) from the current behavior (without this patch), for comparison?
>     While I agree that we should log something more helpful than 
> `ip.error()`, I'm not convinced that we need to convert this (or every?) 
> LOG(FATAL) to an EXIT().

Apologies for being 'economical with words' - I was trying to get this in fast 
as I knew you were cutting an RC yesterday, and this felt minor/insignificant 
enough not to warrant too much discussion.

Equally, I don't see how this change should be controversial: the outcome is 
the same (program exits at the point of failure) but with a more user-friendly 
message, that would enable user to do some self-help debugging, instead of 
claiming on IRC/dev-list that "Mesos Slave crashes on AWS" (this was the 
original report phrased)
Note that the ip.error() message originally logged is usually a complete 
non-starter (I think in this case, it only emits a "Name or service not known").

Original message:
```
$ ./mesos-slave.sh --master=127.0.0.1:5050

WARNING: Logging before InitGoogleLogging() is written to STDERR

F0630 13:04:11.598757 30555 process.cpp:866] Name or service not known

*** Check failure stack trace: ***

Aborted (core dumped)
```


> On July 1, 2015, 7:20 a.m., Adam B wrote:
> > 3rdparty/libprocess/src/process.cpp, lines 890-891
> > <https://reviews.apache.org/r/36061/diff/2/?file=996273#file996273line890>
> >
> >     Why doesn't this need to be an `EXIT_(EXIT_FAILURE)` as well?
> 
> Joris Van Remoortere wrote:
>     We can consider that as well. The main difference is that `net::getIP()` 
> is returning a `Try<net::IP>`, whereas gethostname is still a raw system call.
>     In an ideal world we would be more sure that something that returns a 
> `Try<T>` for a function like this would only error out if there really was a 
> useful error message for the user and otherwise abort if there was a 
> programmer error.

wouldn't this be out-of-scope of this ticket?
we can sure create a Jira for this too (tech-debt label, perhaps?)


- Marco


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36061/#review90014
-----------------------------------------------------------


On July 1, 2015, 12:32 a.m., Marco Massenzio wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36061/
> -----------------------------------------------------------
> 
> (Updated July 1, 2015, 12:32 a.m.)
> 
> 
> Review request for mesos, Adam B and Joris Van Remoortere.
> 
> 
> Bugs: MESOS-2962
>     https://issues.apache.org/jira/browse/MESOS-2962
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Jira: MESOS-2962
> 
> Slave fails with Abort stacktrace when DNS cannot resolve hostname
> 
> If the DNS cannot resolve the hostname for a slave node, we correctly return 
> an Error object, but we then fail with a segfault.
> 
> This code adds a more user-friendly message and exits normally (with an 
> `EXIT_FAILURE` code).
> For example, forcing `net::getIp()` to always return an Error, now causes the 
> slave to exit like this:
> ```
> $ ./bin/mesos-slave.sh --master=10.10.1.121:5405
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> E0630 11:31:45.777465 1944417024 process.cpp:899] Could not obtain the IP 
> address for stratos.local; the DNS service may not be able to resolve it: >>> 
> Marco was here!!!
> 
> $ echo $?
> 1
> ```
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 
> d99947c1598c43c47c88ef3e8038081855f0d1dc 
> 
> Diff: https://reviews.apache.org/r/36061/diff/
> 
> 
> Testing
> -------
> 
> make check
> and manual failing the DNS
> 
> 
> Thanks,
> 
> Marco Massenzio
> 
>

Reply via email to