Hi Benjamin! On Thu, Jan 03, 2019 at 11:09:55AM -0800, Benjamin Redelings wrote:
... >1. The error message indicates that one of the MCMC moves computes three >probabilities as NaN's, and then tries to sample from that weighted set, >which throws an exception that prints a backtrace. > >2. The question is what is yielding an NaN. > >3. I'm surprised that all three numbers are NaN. This suggests that the >current probability was already NaN before the function was called. In that >case running with '-V4' to enable extra logging might show where the NaN >originates. > >4. The stack trace indicates that the probabilities that are NaNs are coming >from sample_SPR_search_one(Parameters&, MCMC::MoveStats&, tree_edge const&, >std::map<tree_edge, bool, std::less<tree_edge>, >std::allocator<std::pair<tree_edge const, bool> > > const&, bool)+0x513 > >This is line 1281 "C = choose_MH(0, PrL)" of src/mcmc/sample-topology-SPR.cc > >5. The build log from arm-arm-01 seems to be getting NaN's in the same >function, although they are -nan instead of +nan. This is only a sample-size >of 2, but suggests the problem occurs mostly in that particular function. > >6. The fact that it takes 7 seconds to crash, and the fact that the previous >test success suggests that this error occurs only after several iterations. >So for most inputs, no NaN is generated. > >7. Since armel does not crash, it looks like there might be difference in how >IEEE math errors are handled between armel and armhf. So, the floating point >emulation code is not exactly the same as the hardward implementation. Does >that sound possible? Totally, yes. I *believe* the ARMv5 and ARMv7 configurations differ here, but I'm hazy on exact details I'll be honest. >On the other hand, its possible that armel would crash too if you reran it, >since the test uses random numbers. However, since there are no errors on >x86 that I can find, this lends some weight to armel also being fine. OK. >8. This makes me wonder what happens if the -ffast-math flag is removed from >this line in src/meson.build: > >add_project_arguments(['-DNDEBUG','-DNDEBUG_DP','-O3','-funroll-loops','-ffast-math'], >language : 'cpp') > >It could be that armel and armhf differ in how they handle math errors when >told to ignore NaN and Inf. > >9. We might be able to find out where the error is happening by changing the >line > > feclearexcept(FE_DIVBYZERO|FE_OVERFLOW|FE_INVALID); > >in `src/bali-phy.cc`. If we change this to just feclearexcept(FE_INVALID); >then I think we'll find the first NaN when it gets generated. But we might >need to run this inside gdb to find out where that occurs. > >I hope this detailed response is helpful... if I could reproduce the error >that would make it easier to fix. > >I don't have any arm hardware though. How do you typically handle cases like >this? I'm more than happy to give out access to one of my machines to help you fix this. Contact me off-list and we can set that up if you like. -- Steve McIntyre, Cambridge, UK. [email protected] Welcome my son, welcome to the machine.

