Re: Filesystem functions executed in thread are slow

Robert Goulet Tue, 17 Nov 2015 11:19:56 -0800

Yes thanks JJ, that should be sufficient for now. Good job!

On Tuesday, November 17, 2015 at 9:54:46 AM UTC-5, jj wrote:
>
> The filesystem is already currently thread-safe, i.e. meaning that 
> multiple threads can safely access files in the filesystem.
>
> When I say "implement a fully multithreaded filesystem", I mean the effect 
> of removing the need to proxy the operations via the main thread. The 
> reason why this proxying is done is that the whole filesystem data 
> structure (MEMFS) is thread-local to the main thread, and it can't be 
> shared between threads. To implement a filesystem where multiple threads 
> can access it without proxying requires moving the filesystem into asm.js 
> heap, and implementing it in C/C++ code (shareable via SharedArrayBuffer) 
> instead of javascript code (unshareable, since JS objects are always thread 
> local).
>
> Also, in this case, it was not the postMessage() that is causing slowdown 
> - the proposed test case does not utilize postMessage to pass the proxying 
> task, but it uses the SharedArrayBuffer data structure. The postMessage() 
> is used in the case when the main thread is idle, to wake it up to perform 
> the proxied call. In this case, the main thread is not idle, but it was 
> sleeping in pthread_join(), where the sleep was performed in 100msec 
> slices. I posted the pull request 
> https://github.com/kripken/emscripten/pull/3923 to dramatically reduce 
> the slice in the main thread to 1msec, which improved the test case from 
> 3232 msecs to 45 msecs. I think this would be enough to call this issue as 
> resolved for the interim, until we have the ability to rewrite the 
> filesystem in C/C++.
>
> 2015-11-17 16:29 GMT+02:00 Robert Goulet <[email protected] 
> <javascript:>>:
>
>> Most likely it would be better to just go straight to making the 
>> filesystem thread safe. This is something we would really like to use in 
>> our game engine. On the other end, I don't know how much else is there that 
>> would still be proxied but I would bet that getting rid of the postMessage 
>> completely like Alon is suggesting would most likely improve performance 
>> for these other cases.
>>
>> Is there any way I can help with making the filesystem thread safe?
>>
>> On Tuesday, November 17, 2015 at 5:51:27 AM UTC-5, jj wrote:
>>>
>>> Thanks for the test case Robert, and sorry for the long delay in 
>>> responding. Marked down 
>>> https://github.com/kripken/emscripten/issues/3922 . When I was writing 
>>> the proxying, I was concerned that it's going to be slow, and I've seen the 
>>> filesystem slowdown in other projects as well, but never really believed it 
>>> could be this much!
>>>
>>> In my mind the end goal has been to remove the proxying in filesystem 
>>> related paths altogether, and implement a fully multithreaded filesystem, 
>>> which would remove the need to proxy. Sounds like we need to finally start 
>>> moving to that direction, unless there's something short term that we could 
>>> do here.
>>>
>>> 2015-11-06 0:52 GMT+02:00 Alon Zakai <[email protected]>:
>>>
>>>> I would definitely expect a postMessage to have some noticeable lag -  
>>>> but I would expect a few ms, not the large amounts we see here. Still, 
>>>> perhaps we can fix this by removing the postMessage entirely. In theory 
>>>> maybe we could wait on a mutex/futex on the main thread, if we are not 
>>>> doing anything else? We could only do it for the duration of a single 
>>>> frame, I suppose, but then at least if called during that time, we would 
>>>> instantaneously respond.
>>>>
>>>> On Thu, Nov 5, 2015 at 8:47 AM, Robert Goulet <[email protected]> 
>>>> wrote:
>>>>
>>>>> Ok I figured out a way to profile both the 
>>>>> emscripten_sync_run_in_main_thread and 
>>>>> emscripten_main_thread_process_queued_calls function and I found that 
>>>>> 99.9% of the time is spent waiting on the futex and the end of 
>>>>> emscripten_sync_run_in_main_thread. The mutex lock/unlock themselves 
>>>>> takes on average 0.01ms so its negligible. Processing the call itself 
>>>>> takes 
>>>>> on average 0.65ms.
>>>>>
>>>>> @Floh Looks like all the time is spent on the context switching 
>>>>> between the thread and the main thread. Perhaps what happens is exactly 
>>>>> what you suggested - main thread loop isn't called as fast as we think, 
>>>>> leaving a big gap between the time we postMessage and the time its 
>>>>> received. Or perhaps its just the message mechanics of WebWorkers that is 
>>>>> just slow?
>>>>>
>>>>>
>>>>> On Thursday, November 5, 2015 at 10:46:30 AM UTC-5, Robert Goulet 
>>>>> wrote:
>>>>>>
>>>>>> Seems like we are dealing with many things here:
>>>>>>
>>>>>>    1. Queue's mutex lock to add the call to the queue 
>>>>>>    (emscripten_sync_run_in_main_thread)
>>>>>>    2. Then we call postMessage (not sure how it works, could there 
>>>>>>    be some blocking here?) to notify we have something in queue
>>>>>>    3. Then queue's mutex unlock
>>>>>>    4. Then we block wait on a futex for the operation to complete
>>>>>>    5. At some point, the main thread wakes up and process the calls 
>>>>>>    in the queue (emscripten_main_thread_process_queued_calls)
>>>>>>    6. Queue's mutex lock to read and empty the queue
>>>>>>    7. Queue's mutex unlock
>>>>>>
>>>>>> I see one area that could be improved, 
>>>>>> emscripten_main_thread_process_queued_calls: Instead of keeping the 
>>>>>> mutex 
>>>>>> locked while we process all queue calls, we could just fetch them on a 
>>>>>> stack variable and unlock the mutex immediately, allowing other threads 
>>>>>> to 
>>>>>> add to the queue while they are being processed, but I'm not sure this 
>>>>>> would improve the situation here since anyway we are waiting on the 
>>>>>> futex 
>>>>>> to complete the fread operation. I'll try to profile the actual locking 
>>>>>> mechanic to see if its the mutex, or the context switching using the 
>>>>>> message that is slow.
>>>>>>
>>>>>> On Thursday, November 5, 2015 at 10:08:08 AM UTC-5, Floh wrote:
>>>>>>>
>>>>>>> Sounds like you need some sort of read-ahead, where a small read 
>>>>>>> would actually read a larger portion of the file into a 
>>>>>>> (size-configurable) 
>>>>>>> buffer on the worker-thread side with a single round-trip to the 
>>>>>>> main-thread, and subsequent small reads wouldn't require a main-thread 
>>>>>>> round-trip until the end of the buffer is reached, and the next read 
>>>>>>> would 
>>>>>>> fill up the buffer again...
>>>>>>>
>>>>>>> Disclaimer: I haven't looked at the code, even if every small read 
>>>>>>> does a complete, blocking thread-roundtrip, this doesn't explain the 
>>>>>>> high 
>>>>>>> blocking times you are seeing. I guess there is another problem that a 
>>>>>>> single read has a too high blocking time (may be it can only do at most 
>>>>>>> 1 
>>>>>>> read per frame, or something similar).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> -Floh.
>>>>>>>
>>>>>>> Am Donnerstag, 5. November 2015 15:09:04 UTC+1 schrieb Robert Goulet:
>>>>>>>>
>>>>>>>> Where is the code that deals with this locking mechanics? I'd like 
>>>>>>>> to take a look at it.
>>>>>>>>
>>>>>>>> Also, it doesn't matter much for us how long the entire process 
>>>>>>>> takes. It's more each read time that matters. If we can improve that, 
>>>>>>>> would 
>>>>>>>> be great!
>>>>>>>>
>>>>>>>> On Wednesday, November 4, 2015 at 5:22:32 PM UTC-5, Alon Zakai 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for the testcase. I see the same results.
>>>>>>>>>
>>>>>>>>> It looks like reducing the number of reads helps a lot. The 
>>>>>>>>> overhead is affected mostly by number of reads. Which could make 
>>>>>>>>> sense if 
>>>>>>>>> the main thread is busy (since it's the main browser thread, it could 
>>>>>>>>> be 
>>>>>>>>> busy doing anything from rendering to doing some work for another 
>>>>>>>>> tab) and 
>>>>>>>>> the worker needs to wait on it. Also, we send a message to the main 
>>>>>>>>> thread, 
>>>>>>>>> so any general activity on the event queue could lead to the message 
>>>>>>>>> being 
>>>>>>>>> received later.
>>>>>>>>>
>>>>>>>>> It's also possible that the mutex and futex stuff we do for the 
>>>>>>>>> blocking call has overhead. Jukka, do we have a way to profile that?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 4, 2015 at 2:06 PM, Robert Goulet <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Here, I quickly wrote this small test case 
>>>>>>>>>> <https://autodesk.box.com/s/h12u5woqkuwyzg4uk7hw6puck8jg8orh> 
>>>>>>>>>> which reproduce the problem.
>>>>>>>>>>
>>>>>>>>>> I get the following output from it:
>>>>>>>>>>
>>>>>>>>>> Preallocating 1 workers for a pthread spawn pool.
>>>>>>>>>> Writing test file (1048576 bytes)...
>>>>>>>>>> Reading file from main thread...
>>>>>>>>>> Completed in 6.730000ms (~0.205156ms per read of 32768 bytes)
>>>>>>>>>> Reading file from another thread...
>>>>>>>>>> Completed in 3499.585000ms (~109.355312ms per read of 32768 bytes)
>>>>>>>>>> Done.
>>>>>>>>>>
>>>>>>>>>> Please let me know if there's anything we can do to fix this 
>>>>>>>>>> major difference between the two.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> On Wednesday, November 4, 2015 at 3:25:58 PM UTC-5, Robert Goulet 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> That is for many small/medium reads. It seems the size of the 
>>>>>>>>>>> read does not have any impact on performance thought.
>>>>>>>>>>>
>>>>>>>>>>> Essentially, the thread we create does the following:
>>>>>>>>>>>
>>>>>>>>>>> while (true) {
>>>>>>>>>>> _queue_semaphore.wait();
>>>>>>>>>>> if (_exit_thread)
>>>>>>>>>>> break;
>>>>>>>>>>> process_request();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> in that case, the process_request() function takes 200ms+ to 
>>>>>>>>>>> execute (profiled within the process_request() function itself, so 
>>>>>>>>>>> that it 
>>>>>>>>>>> does not include locking mechanics overhead).
>>>>>>>>>>>
>>>>>>>>>>> If we just call process_request() right away upon adding request 
>>>>>>>>>>> instead of inserting the request in the thread queue (which 
>>>>>>>>>>> essentially 
>>>>>>>>>>> just bypass the thread completely), the process_request() function 
>>>>>>>>>>> takes 
>>>>>>>>>>> <0.2ms to execute. The only thing this function does is a switch 
>>>>>>>>>>> case 
>>>>>>>>>>> between fopen(), fread() and fclose(). I've narrowed it down to 
>>>>>>>>>>> being these 
>>>>>>>>>>> filesystem function who are taking a much longer time to return.
>>>>>>>>>>>
>>>>>>>>>>> The main thread is waiting on the thread queue to complete 
>>>>>>>>>>> before returning, so I don't see why it would block the thread from 
>>>>>>>>>>> doing 
>>>>>>>>>>> its work?
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, November 4, 2015 at 12:53:21 PM UTC-5, Alon Zakai 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The filesystem itself resides in JS, which can only be accessed 
>>>>>>>>>>>> from the main thread. Workers therefore need to send messages to 
>>>>>>>>>>>> communicate with it. However, 200ms seems ridiculously high - is 
>>>>>>>>>>>> that for a 
>>>>>>>>>>>> single read()? Or many small reads of small amounts? If you can 
>>>>>>>>>>>> make a 
>>>>>>>>>>>> small standalone testcase showing the issue, that would be useful 
>>>>>>>>>>>> for 
>>>>>>>>>>>> benchmarking.
>>>>>>>>>>>>
>>>>>>>>>>>> A possibility is that the blocking is the issue, and the main 
>>>>>>>>>>>> thread is busy with something else.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 4, 2015 at 7:57 AM, Robert Goulet <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> we are using the new pthread support in Emscripten, and one 
>>>>>>>>>>>>> thing we noticed is how much slower filesystem functions are when 
>>>>>>>>>>>>> executed 
>>>>>>>>>>>>> in a thread. We saw this in the documentation:
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Currently several of the functions in the C runtime, such as 
>>>>>>>>>>>>>> filesystem functions like fopen(), fread(), printf(), fprintf() 
>>>>>>>>>>>>>> etc. are 
>>>>>>>>>>>>>> not multithreaded, but instead their execution is proxied over 
>>>>>>>>>>>>>> to the main 
>>>>>>>>>>>>>> application thread.*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm just trying to understand what we are dealing with. At 
>>>>>>>>>>>>> this point I am guessing this is the reason why it is much slower 
>>>>>>>>>>>>> to read a 
>>>>>>>>>>>>> file using fread in a thread. We are seeing 1000x slowdowns 
>>>>>>>>>>>>> compared to 
>>>>>>>>>>>>> running in the main thread directly. For example, running in main 
>>>>>>>>>>>>> thread, a 
>>>>>>>>>>>>> read request can complete in 0.2ms, while in a thread is takes 
>>>>>>>>>>>>> 200ms. Most 
>>>>>>>>>>>>> likely that's the overhead of waiting on the main thread to 
>>>>>>>>>>>>> process proxied 
>>>>>>>>>>>>> requests?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any technical blockers preventing filesystem 
>>>>>>>>>>>>> functions to be multithreaded so that they are no longer put in 
>>>>>>>>>>>>> the main 
>>>>>>>>>>>>> thread proxy queue?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>> Google Groups "emscripten-discuss" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>> it, send an email to 
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "emscripten-discuss" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "emscripten-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "emscripten-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "emscripten-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Filesystem functions executed in thread are slow

Reply via email to