Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Waldek Kozaczuk Sat, 01 Dec 2018 08:27:02 -0800

Here we go again ... I was thinking more about it. 

First of regarding potential corruption of FPU state during involuntary 
interrupt task switch, unless my code that verifies it is not the case using 
the checksum logic (see my previous email with the patch), there could be 
another possibility - we are telling xsave/xstor to save/restore to/from wrong 
location in memory. Meaning that whatever got saved stays not corrupted but 
somehow we are restoring wrong data for whatever reason. Probably very 
unlikely.


Secondly I think that part of our theory, that this famous loop in vfprintf 
loops infinitely until tries to access unmapped page of memory, is very likely 
correct. But it might happening for different reason. So far we thought that 
simply the number is NaN or corrupt that makes FPU Instruction calculate NAN 
forever. But what if for whatever reason we are experiencing FPU stack overflow 
- all registers are taken and any instruction trying to load data into it keeps 
failing because it is full. I believe this exception is disabled but I can 
enable it to see if I am correct and it happens somewhere in the code. Now what 
if after voluntary or involuntary thread switch the FPU stack is full and the 
thread switched to simply does not know it. Whose responsibility is in this 
case to make sure that FPU is in clean state (empty)?

Maybe the involuntary switch has/should have two sub-cases:
1) the thread we are switching to before was also involuntarily switched and we 
need to restore its FPU state from interrupt stack which we do 
2) the thread we are switching to voluntarily yielded and ... I think FPU 
should be in somewhat clean state, at least be empty unless something left it 
full before, right?

My thinking might be all wrong. But just food for thought. 

Waldek

Sent from my iPhone

> On Dec 1, 2018, at 00:46, Waldek Kozaczuk <[email protected]> wrote:
> 
> My theory has fallen apart: I modified httpserver like this to pass constant 
> float value of 0.5 and still getting same crash:
> 
> index 358a86c6..126ab5f4 100644
> --- a/modules/httpserver-api/api/os.cc
> +++ b/modules/httpserver-api/api/os.cc
> @@ -131,7 +131,7 @@ void init(routes& routes)
>              thread.migrations = t.stat_migrations.get();
>              thread.preemptions = t.stat_preemptions.get();
>              thread.name = t.name();
> -            thread.priority = t.priority();
> +            thread.priority = 1.0;//t.priority();
>              thread.stack_size = t.get_stack_info().size;
>              thread.status = t.get_status();
>              threads.list.push(thread);
> 
>  I even created separate test program that keeps sprintf'ing both NAN and 
> regular float numbers using 5 threads to replace ffmpeg and I could not 
> replicate same program running exact same scenario with httpserver.
> 
> As I wrote in other thread I added checksum logic to detect FPU state 
> corruption that unfortunately does not detect anything. But is this logic 
> actually correct?
> 
> Ediff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
> index bb96af3d..34bf6081 100644
> --- a/arch/x64/arch-cpu.cc
> +++ b/arch/x64/arch-cpu.cc
> @@ -42,6 +42,14 @@ exception_guard::~exception_guard()
>      sched::cpu::current()->arch.exit_exception();
>  }
>  
> +static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
> +    u64 val = 0;
> +    char* data = s->legacy;
> +    for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)
> +       val += data[i];
> +    return val;
> +}
> +
>  extern "C"
>  [[gnu::target("no-sse")]]
>  void fpu_state_init_xsave(processor::fpu_state *s) {
> @@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {
>  extern "C"
>  void fpu_state_save_xsave(processor::fpu_state *s) {
>      processor::xsave(s, -1);
> +    s->checksum = calculate_fpu_state_checksum(s);
>  }
>  
>  extern "C"
>  void fpu_state_save_fxsave(processor::fpu_state *s) {
>      processor::fxsave(s);
> +    s->checksum = calculate_fpu_state_checksum(s);
>  }
>  
>  extern "C"
>  void fpu_state_restore_xsave(processor::fpu_state *s) {
>      processor::xrstor(s, -1);
> +    assert(s->checksum == calculate_fpu_state_checksum(s));
>  }
>  
>  extern "C"
>  void fpu_state_restore_fxsave(processor::fpu_state *s) {
>      processor::fxrstor(s);
> +    assert(s->checksum == calculate_fpu_state_checksum(s));
>  }
>  
>  extern "C"
> diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh
> index 6b1fbf66..f699d5b8 100644
> --- a/arch/x64/processor.hh
> +++ b/arch/x64/processor.hh
> @@ -343,6 +343,7 @@ struct fpu_state {
>      char xsavehdr[24];
>      char reserved[40];
>      char ymm[256];
> +    u64 checksum;
>  } __attribute__((packed));
>  
>  inline void fxsave(fpu_state* s)
> 
>> On Friday, November 30, 2018 at 7:43:31 PM UTC-5, Nadav Har'El wrote:
>>> On Thu, Nov 29, 2018 at 9:33 PM Waldek Kozaczuk <[email protected]> wrote:
>>> I wonder if per this - https://software.intel.com/en-us/node/523375 - we 
>>> are enabling correct bits for xsave if caste is what being used here. 
>>> 
>>> “The second operand is a save/restore mask specifying the saved/restored 
>>> extended states. The value of the mask is ANDed with 
>>> XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored 
>>> only if the corresponding bit of both save/restore mask and 
>>> XFEATURE_ENABLED_MASK is set to '1'.”
>> 
>> Interesting question. In arch/x64/arch-cpu.hh, arch_cpu::init_on_cpu(), you 
>> can see:
>> 
>>     if (features().xsave) {
>>         auto bits = xcr0_x87 | xcr0_sse;
>>         if (features().avx) {
>>             bits |= xcr0_avx;
>>         }
>>         write_xcr(xcr0, bits);
>>     }
>> 
>> So we're turning on the "x87", "sse", and "avx" bits. Are there any other 
>> bits we need to set (or at least, not zero with this overwrite-all-bits 
>> code)?
>> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "OSv Development" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

Reply via email to