Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Nadav Har'El Fri, 30 Nov 2018 16:34:14 -0800

On Fri, Nov 30, 2018 at 11:34 PM Waldek Kozaczuk <[email protected]>
wrote:


> I think I have found the culprit!
>

If you remember, I suggested in the past that if the number being printed
had been a NaN, the loop in question might have gone on forever. However,
IIRC the code checks for NaN and infinity before entering the loop. This is
why I suggested maybe an FPU corruption might have changed some good number
into a NaN.

Now you're suggesting, maybe the original number passed to printf() is
somehow invalid, and the printf() code doesn't recognize it as special and
uses it in the loop causing the loop to never end. You should be able to
*directly* test this hypotesis - pick an "invalid" number, pass it to
printf(), and see if runs into the same loop (and crash) we are chasing.
But what "invalid" number to pick? I don't know, I'm not an expert enough
in the floating point standard. I'd start with an sNaN (signalling NaN),
and if that doesn't help, take the same floating point number which crashed
the http example.


> Related to what I am pointing out here -
> https://groups.google.com/forum/#!topic/osv-dev/-7vIPwOJsyg - I thought
> about enabling some FPU exceptions which are disabled by default. Once I
> changed 0x037f to 0x037e which enables invalid operand exception and ran
> my test and got this exception:
>
> [registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0
> drop=273 speed=  28x
> RIP: 0x000000000044fb28 <???+4520744>
> RFL: 0x0000000000010206  CS:  0x0000000000000008  SS:  0x0000000000000010
> RAX: 0x0000000000000040  RBX: 0x00001000000faa22  RCX: 0x00002000001fd960
>  RDX: 0x00002000001fd618
> RSI: 0x000000000000000a  RDI: 0x00002000001fd580  RBP: 0x00002000001fd5f0
>  R8:  0x0000000000000000
> R9:  0x0000000000000000  R10: 0x0000000000000000  R11: 0x00000000ffffffff
>  R12: 0x0000000000000000
> R13: 0x0000000000000000  R14: 0x0000000000000000  R15: 0x00000000ffffffff
>  RSP: 0x00002000001fd508
> DUMMY_HANDLER for math_fault aborting.
>

This analysis is interesting - it shows we are trying to print invalid
floating-point numbers, which is definitely a bug. However, I have no idea
if without enabling the invalid-operand exception the result would have
been an endless loop in printf(). If the answer is no, then you found
another bug, but not the same bug we are chasing :-(

If you can find a way to print the floating-point number which caused this
crash (its representation as an integer, of course, not with %f :-)), you
can try calling printf() with the same number and seeing whether it goes
into an endless loop. If it does, then we really have our culprit, or even
two: First, printf() shouldn't crash on an invalid number. Second, why we
get an invalid number in the first place.

My theory is this:
>
> 1. Httpserver tries to print thread list using this code:
> 118     os_threads.set_handler([](const_req req) {
> 119         using namespace std::chrono;
> 120         httpserver::json::Threads threads;
> 121         threads.time_ms = duration_cast<milliseconds>
> 122             (osv::clock::wall::now().time_since_epoch()).count();
> 123         httpserver::json::Thread thread;
> 124         sched::with_all_threads([&](sched::thread &t) {
> 125             thread.id = t.id();
> 126             thread.status = t.get_status();
> 127             auto tcpu = t.tcpu();
> 128             thread.cpu = tcpu ? tcpu->id : -1;
> 129             thread.cpu_ms =
> duration_cast<milliseconds>(t.thread_clock()).count();
> 130             thread.switches = t.stat_switches.get();
> 131             thread.migrations = t.stat_migrations.get();
> 132             thread.preemptions = t.stat_preemptions.get();
> 133             thread.name = t.name();
> 134             thread.priority = t.priority();
> 135             thread.stack_size = t.get_stack_info().size;
> 136             thread.status = t.get_status();
> 137             threads.list.push(thread);
> 138         });
> 139         return threads;
> 140     });
> This code uses sched::with_all_threads() to iterate over the list of
> threads which uses mutex guarding against adding new threads and destroying
> existing one. However I think this does not guard any of the values grabbed
> to populate httpserver::json::Thread from* being concurrently modified by
> OSv scheduler* (am I wrong?).
> *The priority field is float which when copied in not-thread-safe way ends
> up being junk which does not represent any valid float value (invalid
> operand).*
>

It is true that there is no special protection for the priority field, but
there are two reasons why I doubt that this is the problem (although I
can't be sure...):
1. priority is a 32-bit "float" (not a 64-bit double). An aligned 32-bit
number should be atomic, I believe.
2. In any case, we don't commonly change the priority field of threads
after they are created. I'm not sure I have even one example of this. It's
possible, but I doubt we actually do it?


> 2. The code downstream to serialize JSON calls vfprintf which at line 165
> tries to extract the single element from va_list. This probably ends up
> using FPU which rejects corrupted priority float value. Because normally
> invalid operand exception is disabled vfprintf happily continues and gets
> to that famous loop which probably keeps raising same exception which gets
> ignored until we reach the beginning of the unmapped page and boom !!!
>
> How do you like my theory?
>

I like it, but it has some big holes that can be tested (see above).


> I think this theory has its shortcomings - I think it explains open issues
> #536 and #1010 (and possibly to soon closed #490) but I think it fails to
> explain the newest one I created -
> https://github.com/cloudius-systems/osv/issues/1018 (there is no
> httpserver running). Maybe the latter one is due to some issues with
> missing initialization of FPU control word (but I doubt it).
>
> Waldek
>
> On Thursday, November 29, 2018 at 6:19:45 PM UTC-5, Waldek Kozaczuk wrote:
>>
>> Unless I made a mistake in the code that verifies fpu state corruption
>> which showed no corruption I think it might be worth to ponder possibility
>> of different nature of the root cause of this and other issues where
>> fmt_fp() shows up in the stack trace.
>>
>> What if the problem is actually caused by wrong/incompatible compiler
>> options that drive what machine code is generated to handle floating point
>> operation. For example I noticed that when I configured musl independently
>> on my machine I saw it add following options related to floating point math:
>>
>> CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -*fexcess-precision=*
>> *standard* *-frounding-**math* -Wa,--noexecstack
>>
>> I have not enough experience to make any conclusive observations but two
>> of those -fexcess-precision and -frounding-math do affet code using float,
>> double, long double etc. I did not see those in main OSv makefile so I
>> wonder if musl compiled code is wrong and misbehaves.
>>
>> I think FPU has control word with RC fields that control rounding (
>> http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if
>> our code setups FPU rounding that is incompatible with how we compile musl.
>> Just wild theory.
>>
>> Adding musl math library documentation -
>> https://wiki.musl-libc.org/mathematical-library.html- there are
>> paragraphs that talk about rounding and precision.
>>
>> On Thursday, November 29, 2018 at 2:33:26 PM UTC-5, Waldek Kozaczuk wrote:
>>>
>>> I wonder if per this - https://software.intel.com/en-us/node/523375 -
>>> we are enabling correct bits for xsave if caste is what being used here.
>>>
>>> “The second operand is a save/restore mask specifying the saved/restored
>>> extended states. The value of the mask is ANDed with
>>> XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored
>>> only if the corresponding bit of both save/restore mask and
>>> XFEATURE_ENABLED_MASK is set to '1'.”
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to