On Fri, Nov 30, 2018 at 11:34 PM Waldek Kozaczuk <[email protected]> wrote:
> I think I have found the culprit! > If you remember, I suggested in the past that if the number being printed had been a NaN, the loop in question might have gone on forever. However, IIRC the code checks for NaN and infinity before entering the loop. This is why I suggested maybe an FPU corruption might have changed some good number into a NaN. Now you're suggesting, maybe the original number passed to printf() is somehow invalid, and the printf() code doesn't recognize it as special and uses it in the loop causing the loop to never end. You should be able to *directly* test this hypotesis - pick an "invalid" number, pass it to printf(), and see if runs into the same loop (and crash) we are chasing. But what "invalid" number to pick? I don't know, I'm not an expert enough in the floating point standard. I'd start with an sNaN (signalling NaN), and if that doesn't help, take the same floating point number which crashed the http example. > Related to what I am pointing out here - > https://groups.google.com/forum/#!topic/osv-dev/-7vIPwOJsyg - I thought > about enabling some FPU exceptions which are disabled by default. Once I > changed 0x037f to 0x037e which enables invalid operand exception and ran > my test and got this exception: > > [registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0 > drop=273 speed= 28x > RIP: 0x000000000044fb28 <???+4520744> > RFL: 0x0000000000010206 CS: 0x0000000000000008 SS: 0x0000000000000010 > RAX: 0x0000000000000040 RBX: 0x00001000000faa22 RCX: 0x00002000001fd960 > RDX: 0x00002000001fd618 > RSI: 0x000000000000000a RDI: 0x00002000001fd580 RBP: 0x00002000001fd5f0 > R8: 0x0000000000000000 > R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x00000000ffffffff > R12: 0x0000000000000000 > R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x00000000ffffffff > RSP: 0x00002000001fd508 > DUMMY_HANDLER for math_fault aborting. > This analysis is interesting - it shows we are trying to print invalid floating-point numbers, which is definitely a bug. However, I have no idea if without enabling the invalid-operand exception the result would have been an endless loop in printf(). If the answer is no, then you found another bug, but not the same bug we are chasing :-( If you can find a way to print the floating-point number which caused this crash (its representation as an integer, of course, not with %f :-)), you can try calling printf() with the same number and seeing whether it goes into an endless loop. If it does, then we really have our culprit, or even two: First, printf() shouldn't crash on an invalid number. Second, why we get an invalid number in the first place. My theory is this: > > 1. Httpserver tries to print thread list using this code: > 118 os_threads.set_handler([](const_req req) { > 119 using namespace std::chrono; > 120 httpserver::json::Threads threads; > 121 threads.time_ms = duration_cast<milliseconds> > 122 (osv::clock::wall::now().time_since_epoch()).count(); > 123 httpserver::json::Thread thread; > 124 sched::with_all_threads([&](sched::thread &t) { > 125 thread.id = t.id(); > 126 thread.status = t.get_status(); > 127 auto tcpu = t.tcpu(); > 128 thread.cpu = tcpu ? tcpu->id : -1; > 129 thread.cpu_ms = > duration_cast<milliseconds>(t.thread_clock()).count(); > 130 thread.switches = t.stat_switches.get(); > 131 thread.migrations = t.stat_migrations.get(); > 132 thread.preemptions = t.stat_preemptions.get(); > 133 thread.name = t.name(); > 134 thread.priority = t.priority(); > 135 thread.stack_size = t.get_stack_info().size; > 136 thread.status = t.get_status(); > 137 threads.list.push(thread); > 138 }); > 139 return threads; > 140 }); > This code uses sched::with_all_threads() to iterate over the list of > threads which uses mutex guarding against adding new threads and destroying > existing one. However I think this does not guard any of the values grabbed > to populate httpserver::json::Thread from* being concurrently modified by > OSv scheduler* (am I wrong?). > *The priority field is float which when copied in not-thread-safe way ends > up being junk which does not represent any valid float value (invalid > operand).* > It is true that there is no special protection for the priority field, but there are two reasons why I doubt that this is the problem (although I can't be sure...): 1. priority is a 32-bit "float" (not a 64-bit double). An aligned 32-bit number should be atomic, I believe. 2. In any case, we don't commonly change the priority field of threads after they are created. I'm not sure I have even one example of this. It's possible, but I doubt we actually do it? > 2. The code downstream to serialize JSON calls vfprintf which at line 165 > tries to extract the single element from va_list. This probably ends up > using FPU which rejects corrupted priority float value. Because normally > invalid operand exception is disabled vfprintf happily continues and gets > to that famous loop which probably keeps raising same exception which gets > ignored until we reach the beginning of the unmapped page and boom !!! > > How do you like my theory? > I like it, but it has some big holes that can be tested (see above). > I think this theory has its shortcomings - I think it explains open issues > #536 and #1010 (and possibly to soon closed #490) but I think it fails to > explain the newest one I created - > https://github.com/cloudius-systems/osv/issues/1018 (there is no > httpserver running). Maybe the latter one is due to some issues with > missing initialization of FPU control word (but I doubt it). > > Waldek > > On Thursday, November 29, 2018 at 6:19:45 PM UTC-5, Waldek Kozaczuk wrote: >> >> Unless I made a mistake in the code that verifies fpu state corruption >> which showed no corruption I think it might be worth to ponder possibility >> of different nature of the root cause of this and other issues where >> fmt_fp() shows up in the stack trace. >> >> What if the problem is actually caused by wrong/incompatible compiler >> options that drive what machine code is generated to handle floating point >> operation. For example I noticed that when I configured musl independently >> on my machine I saw it add following options related to floating point math: >> >> CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -*fexcess-precision=* >> *standard* *-frounding-**math* -Wa,--noexecstack >> >> I have not enough experience to make any conclusive observations but two >> of those -fexcess-precision and -frounding-math do affet code using float, >> double, long double etc. I did not see those in main OSv makefile so I >> wonder if musl compiled code is wrong and misbehaves. >> >> I think FPU has control word with RC fields that control rounding ( >> http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if >> our code setups FPU rounding that is incompatible with how we compile musl. >> Just wild theory. >> >> Adding musl math library documentation - >> https://wiki.musl-libc.org/mathematical-library.html- there are >> paragraphs that talk about rounding and precision. >> >> On Thursday, November 29, 2018 at 2:33:26 PM UTC-5, Waldek Kozaczuk wrote: >>> >>> I wonder if per this - https://software.intel.com/en-us/node/523375 - >>> we are enabling correct bits for xsave if caste is what being used here. >>> >>> “The second operand is a save/restore mask specifying the saved/restored >>> extended states. The value of the mask is ANDed with >>> XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored >>> only if the corresponding bit of both save/restore mask and >>> XFEATURE_ENABLED_MASK is set to '1'.” >> >> -- > You received this message because you are subscribed to the Google Groups > "OSv Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
