Re: Filesystem functions executed in thread are slow

Robert Goulet Tue, 17 Nov 2015 06:29:49 -0800

Most likely it would be better to just go straight to making the filesystem 
thread safe. This is something we would really like to use in our game 
engine. On the other end, I don't know how much else is there that would 
still be proxied but I would bet that getting rid of the postMessage 
completely like Alon is suggesting would most likely improve performance 
for these other cases.


Is there any way I can help with making the filesystem thread safe?

On Tuesday, November 17, 2015 at 5:51:27 AM UTC-5, jj wrote:
>
> Thanks for the test case Robert, and sorry for the long delay in 
> responding. Marked down https://github.com/kripken/emscripten/issues/3922 
> . When I was writing the proxying, I was concerned that it's going to be 
> slow, and I've seen the filesystem slowdown in other projects as well, but 
> never really believed it could be this much!
>
> In my mind the end goal has been to remove the proxying in filesystem 
> related paths altogether, and implement a fully multithreaded filesystem, 
> which would remove the need to proxy. Sounds like we need to finally start 
> moving to that direction, unless there's something short term that we could 
> do here.
>
> 2015-11-06 0:52 GMT+02:00 Alon Zakai <[email protected] <javascript:>>:
>
>> I would definitely expect a postMessage to have some noticeable lag -  
>> but I would expect a few ms, not the large amounts we see here. Still, 
>> perhaps we can fix this by removing the postMessage entirely. In theory 
>> maybe we could wait on a mutex/futex on the main thread, if we are not 
>> doing anything else? We could only do it for the duration of a single 
>> frame, I suppose, but then at least if called during that time, we would 
>> instantaneously respond.
>>
>> On Thu, Nov 5, 2015 at 8:47 AM, Robert Goulet <[email protected] 
>> <javascript:>> wrote:
>>
>>> Ok I figured out a way to profile both the 
>>> emscripten_sync_run_in_main_thread and 
>>> emscripten_main_thread_process_queued_calls function and I found that 
>>> 99.9% of the time is spent waiting on the futex and the end of 
>>> emscripten_sync_run_in_main_thread. The mutex lock/unlock themselves 
>>> takes on average 0.01ms so its negligible. Processing the call itself takes 
>>> on average 0.65ms.
>>>
>>> @Floh Looks like all the time is spent on the context switching between 
>>> the thread and the main thread. Perhaps what happens is exactly what you 
>>> suggested - main thread loop isn't called as fast as we think, leaving a 
>>> big gap between the time we postMessage and the time its received. Or 
>>> perhaps its just the message mechanics of WebWorkers that is just slow?
>>>
>>>
>>> On Thursday, November 5, 2015 at 10:46:30 AM UTC-5, Robert Goulet wrote:
>>>>
>>>> Seems like we are dealing with many things here:
>>>>
>>>>    1. Queue's mutex lock to add the call to the queue 
>>>>    (emscripten_sync_run_in_main_thread)
>>>>    2. Then we call postMessage (not sure how it works, could there be 
>>>>    some blocking here?) to notify we have something in queue
>>>>    3. Then queue's mutex unlock
>>>>    4. Then we block wait on a futex for the operation to complete
>>>>    5. At some point, the main thread wakes up and process the calls in 
>>>>    the queue (emscripten_main_thread_process_queued_calls)
>>>>    6. Queue's mutex lock to read and empty the queue
>>>>    7. Queue's mutex unlock
>>>>
>>>> I see one area that could be improved, 
>>>> emscripten_main_thread_process_queued_calls: Instead of keeping the mutex 
>>>> locked while we process all queue calls, we could just fetch them on a 
>>>> stack variable and unlock the mutex immediately, allowing other threads to 
>>>> add to the queue while they are being processed, but I'm not sure this 
>>>> would improve the situation here since anyway we are waiting on the futex 
>>>> to complete the fread operation. I'll try to profile the actual locking 
>>>> mechanic to see if its the mutex, or the context switching using the 
>>>> message that is slow.
>>>>
>>>> On Thursday, November 5, 2015 at 10:08:08 AM UTC-5, Floh wrote:
>>>>>
>>>>> Sounds like you need some sort of read-ahead, where a small read would 
>>>>> actually read a larger portion of the file into a (size-configurable) 
>>>>> buffer on the worker-thread side with a single round-trip to the 
>>>>> main-thread, and subsequent small reads wouldn't require a main-thread 
>>>>> round-trip until the end of the buffer is reached, and the next read 
>>>>> would 
>>>>> fill up the buffer again...
>>>>>
>>>>> Disclaimer: I haven't looked at the code, even if every small read 
>>>>> does a complete, blocking thread-roundtrip, this doesn't explain the high 
>>>>> blocking times you are seeing. I guess there is another problem that a 
>>>>> single read has a too high blocking time (may be it can only do at most 1 
>>>>> read per frame, or something similar).
>>>>>
>>>>> Cheers,
>>>>> -Floh.
>>>>>
>>>>> Am Donnerstag, 5. November 2015 15:09:04 UTC+1 schrieb Robert Goulet:
>>>>>>
>>>>>> Where is the code that deals with this locking mechanics? I'd like to 
>>>>>> take a look at it.
>>>>>>
>>>>>> Also, it doesn't matter much for us how long the entire process 
>>>>>> takes. It's more each read time that matters. If we can improve that, 
>>>>>> would 
>>>>>> be great!
>>>>>>
>>>>>> On Wednesday, November 4, 2015 at 5:22:32 PM UTC-5, Alon Zakai wrote:
>>>>>>>
>>>>>>> Thanks for the testcase. I see the same results.
>>>>>>>
>>>>>>> It looks like reducing the number of reads helps a lot. The overhead 
>>>>>>> is affected mostly by number of reads. Which could make sense if the 
>>>>>>> main 
>>>>>>> thread is busy (since it's the main browser thread, it could be busy 
>>>>>>> doing 
>>>>>>> anything from rendering to doing some work for another tab) and the 
>>>>>>> worker 
>>>>>>> needs to wait on it. Also, we send a message to the main thread, so any 
>>>>>>> general activity on the event queue could lead to the message being 
>>>>>>> received later.
>>>>>>>
>>>>>>> It's also possible that the mutex and futex stuff we do for the 
>>>>>>> blocking call has overhead. Jukka, do we have a way to profile that?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 4, 2015 at 2:06 PM, Robert Goulet <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Here, I quickly wrote this small test case 
>>>>>>>> <https://autodesk.box.com/s/h12u5woqkuwyzg4uk7hw6puck8jg8orh> 
>>>>>>>> which reproduce the problem.
>>>>>>>>
>>>>>>>> I get the following output from it:
>>>>>>>>
>>>>>>>> Preallocating 1 workers for a pthread spawn pool.
>>>>>>>> Writing test file (1048576 bytes)...
>>>>>>>> Reading file from main thread...
>>>>>>>> Completed in 6.730000ms (~0.205156ms per read of 32768 bytes)
>>>>>>>> Reading file from another thread...
>>>>>>>> Completed in 3499.585000ms (~109.355312ms per read of 32768 bytes)
>>>>>>>> Done.
>>>>>>>>
>>>>>>>> Please let me know if there's anything we can do to fix this major 
>>>>>>>> difference between the two.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Wednesday, November 4, 2015 at 3:25:58 PM UTC-5, Robert Goulet 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> That is for many small/medium reads. It seems the size of the read 
>>>>>>>>> does not have any impact on performance thought.
>>>>>>>>>
>>>>>>>>> Essentially, the thread we create does the following:
>>>>>>>>>
>>>>>>>>> while (true) {
>>>>>>>>> _queue_semaphore.wait();
>>>>>>>>> if (_exit_thread)
>>>>>>>>> break;
>>>>>>>>> process_request();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> in that case, the process_request() function takes 200ms+ to 
>>>>>>>>> execute (profiled within the process_request() function itself, so 
>>>>>>>>> that it 
>>>>>>>>> does not include locking mechanics overhead).
>>>>>>>>>
>>>>>>>>> If we just call process_request() right away upon adding request 
>>>>>>>>> instead of inserting the request in the thread queue (which 
>>>>>>>>> essentially 
>>>>>>>>> just bypass the thread completely), the process_request() function 
>>>>>>>>> takes 
>>>>>>>>> <0.2ms to execute. The only thing this function does is a switch case 
>>>>>>>>> between fopen(), fread() and fclose(). I've narrowed it down to being 
>>>>>>>>> these 
>>>>>>>>> filesystem function who are taking a much longer time to return.
>>>>>>>>>
>>>>>>>>> The main thread is waiting on the thread queue to complete before 
>>>>>>>>> returning, so I don't see why it would block the thread from doing 
>>>>>>>>> its work?
>>>>>>>>>
>>>>>>>>> On Wednesday, November 4, 2015 at 12:53:21 PM UTC-5, Alon Zakai 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> The filesystem itself resides in JS, which can only be accessed 
>>>>>>>>>> from the main thread. Workers therefore need to send messages to 
>>>>>>>>>> communicate with it. However, 200ms seems ridiculously high - is 
>>>>>>>>>> that for a 
>>>>>>>>>> single read()? Or many small reads of small amounts? If you can make 
>>>>>>>>>> a 
>>>>>>>>>> small standalone testcase showing the issue, that would be useful 
>>>>>>>>>> for 
>>>>>>>>>> benchmarking.
>>>>>>>>>>
>>>>>>>>>> A possibility is that the blocking is the issue, and the main 
>>>>>>>>>> thread is busy with something else.
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 4, 2015 at 7:57 AM, Robert Goulet <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> we are using the new pthread support in Emscripten, and one 
>>>>>>>>>>> thing we noticed is how much slower filesystem functions are when 
>>>>>>>>>>> executed 
>>>>>>>>>>> in a thread. We saw this in the documentation:
>>>>>>>>>>>
>>>>>>>>>>> *Currently several of the functions in the C runtime, such as 
>>>>>>>>>>>> filesystem functions like fopen(), fread(), printf(), fprintf() 
>>>>>>>>>>>> etc. are 
>>>>>>>>>>>> not multithreaded, but instead their execution is proxied over to 
>>>>>>>>>>>> the main 
>>>>>>>>>>>> application thread.*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm just trying to understand what we are dealing with. At this 
>>>>>>>>>>> point I am guessing this is the reason why it is much slower to 
>>>>>>>>>>> read a file 
>>>>>>>>>>> using fread in a thread. We are seeing 1000x slowdowns compared to 
>>>>>>>>>>> running 
>>>>>>>>>>> in the main thread directly. For example, running in main thread, a 
>>>>>>>>>>> read 
>>>>>>>>>>> request can complete in 0.2ms, while in a thread is takes 200ms. 
>>>>>>>>>>> Most 
>>>>>>>>>>> likely that's the overhead of waiting on the main thread to process 
>>>>>>>>>>> proxied 
>>>>>>>>>>> requests?
>>>>>>>>>>>
>>>>>>>>>>> Is there any technical blockers preventing filesystem functions 
>>>>>>>>>>> to be multithreaded so that they are no longer put in the main 
>>>>>>>>>>> thread proxy 
>>>>>>>>>>> queue?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>> Google Groups "emscripten-discuss" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>> it, send an email to 
>>>>>>>>>>> [email protected].
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "emscripten-discuss" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "emscripten-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] 
>>> <javascript:>.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "emscripten-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Filesystem functions executed in thread are slow

Reply via email to