I would definitely expect a postMessage to have some noticeable lag - but I would expect a few ms, not the large amounts we see here. Still, perhaps we can fix this by removing the postMessage entirely. In theory maybe we could wait on a mutex/futex on the main thread, if we are not doing anything else? We could only do it for the duration of a single frame, I suppose, but then at least if called during that time, we would instantaneously respond.
On Thu, Nov 5, 2015 at 8:47 AM, Robert Goulet <[email protected]> wrote: > Ok I figured out a way to profile both the > emscripten_sync_run_in_main_thread and > emscripten_main_thread_process_queued_calls function and I found that > 99.9% of the time is spent waiting on the futex and the end of > emscripten_sync_run_in_main_thread. The mutex lock/unlock themselves > takes on average 0.01ms so its negligible. Processing the call itself takes > on average 0.65ms. > > @Floh Looks like all the time is spent on the context switching between > the thread and the main thread. Perhaps what happens is exactly what you > suggested - main thread loop isn't called as fast as we think, leaving a > big gap between the time we postMessage and the time its received. Or > perhaps its just the message mechanics of WebWorkers that is just slow? > > > On Thursday, November 5, 2015 at 10:46:30 AM UTC-5, Robert Goulet wrote: >> >> Seems like we are dealing with many things here: >> >> 1. Queue's mutex lock to add the call to the queue >> (emscripten_sync_run_in_main_thread) >> 2. Then we call postMessage (not sure how it works, could there be >> some blocking here?) to notify we have something in queue >> 3. Then queue's mutex unlock >> 4. Then we block wait on a futex for the operation to complete >> 5. At some point, the main thread wakes up and process the calls in >> the queue (emscripten_main_thread_process_queued_calls) >> 6. Queue's mutex lock to read and empty the queue >> 7. Queue's mutex unlock >> >> I see one area that could be improved, >> emscripten_main_thread_process_queued_calls: Instead of keeping the mutex >> locked while we process all queue calls, we could just fetch them on a >> stack variable and unlock the mutex immediately, allowing other threads to >> add to the queue while they are being processed, but I'm not sure this >> would improve the situation here since anyway we are waiting on the futex >> to complete the fread operation. I'll try to profile the actual locking >> mechanic to see if its the mutex, or the context switching using the >> message that is slow. >> >> On Thursday, November 5, 2015 at 10:08:08 AM UTC-5, Floh wrote: >>> >>> Sounds like you need some sort of read-ahead, where a small read would >>> actually read a larger portion of the file into a (size-configurable) >>> buffer on the worker-thread side with a single round-trip to the >>> main-thread, and subsequent small reads wouldn't require a main-thread >>> round-trip until the end of the buffer is reached, and the next read would >>> fill up the buffer again... >>> >>> Disclaimer: I haven't looked at the code, even if every small read does >>> a complete, blocking thread-roundtrip, this doesn't explain the high >>> blocking times you are seeing. I guess there is another problem that a >>> single read has a too high blocking time (may be it can only do at most 1 >>> read per frame, or something similar). >>> >>> Cheers, >>> -Floh. >>> >>> Am Donnerstag, 5. November 2015 15:09:04 UTC+1 schrieb Robert Goulet: >>>> >>>> Where is the code that deals with this locking mechanics? I'd like to >>>> take a look at it. >>>> >>>> Also, it doesn't matter much for us how long the entire process takes. >>>> It's more each read time that matters. If we can improve that, would be >>>> great! >>>> >>>> On Wednesday, November 4, 2015 at 5:22:32 PM UTC-5, Alon Zakai wrote: >>>>> >>>>> Thanks for the testcase. I see the same results. >>>>> >>>>> It looks like reducing the number of reads helps a lot. The overhead >>>>> is affected mostly by number of reads. Which could make sense if the main >>>>> thread is busy (since it's the main browser thread, it could be busy doing >>>>> anything from rendering to doing some work for another tab) and the worker >>>>> needs to wait on it. Also, we send a message to the main thread, so any >>>>> general activity on the event queue could lead to the message being >>>>> received later. >>>>> >>>>> It's also possible that the mutex and futex stuff we do for the >>>>> blocking call has overhead. Jukka, do we have a way to profile that? >>>>> >>>>> >>>>> On Wed, Nov 4, 2015 at 2:06 PM, Robert Goulet <[email protected]> >>>>> wrote: >>>>> >>>>>> Here, I quickly wrote this small test case >>>>>> <https://autodesk.box.com/s/h12u5woqkuwyzg4uk7hw6puck8jg8orh> which >>>>>> reproduce the problem. >>>>>> >>>>>> I get the following output from it: >>>>>> >>>>>> Preallocating 1 workers for a pthread spawn pool. >>>>>> Writing test file (1048576 bytes)... >>>>>> Reading file from main thread... >>>>>> Completed in 6.730000ms (~0.205156ms per read of 32768 bytes) >>>>>> Reading file from another thread... >>>>>> Completed in 3499.585000ms (~109.355312ms per read of 32768 bytes) >>>>>> Done. >>>>>> >>>>>> Please let me know if there's anything we can do to fix this major >>>>>> difference between the two. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Wednesday, November 4, 2015 at 3:25:58 PM UTC-5, Robert Goulet >>>>>> wrote: >>>>>>> >>>>>>> That is for many small/medium reads. It seems the size of the read >>>>>>> does not have any impact on performance thought. >>>>>>> >>>>>>> Essentially, the thread we create does the following: >>>>>>> >>>>>>> while (true) { >>>>>>> _queue_semaphore.wait(); >>>>>>> if (_exit_thread) >>>>>>> break; >>>>>>> process_request(); >>>>>>> } >>>>>>> >>>>>>> in that case, the process_request() function takes 200ms+ to execute >>>>>>> (profiled within the process_request() function itself, so that it does >>>>>>> not >>>>>>> include locking mechanics overhead). >>>>>>> >>>>>>> If we just call process_request() right away upon adding request >>>>>>> instead of inserting the request in the thread queue (which essentially >>>>>>> just bypass the thread completely), the process_request() function takes >>>>>>> <0.2ms to execute. The only thing this function does is a switch case >>>>>>> between fopen(), fread() and fclose(). I've narrowed it down to being >>>>>>> these >>>>>>> filesystem function who are taking a much longer time to return. >>>>>>> >>>>>>> The main thread is waiting on the thread queue to complete before >>>>>>> returning, so I don't see why it would block the thread from doing its >>>>>>> work? >>>>>>> >>>>>>> On Wednesday, November 4, 2015 at 12:53:21 PM UTC-5, Alon Zakai >>>>>>> wrote: >>>>>>>> >>>>>>>> The filesystem itself resides in JS, which can only be accessed >>>>>>>> from the main thread. Workers therefore need to send messages to >>>>>>>> communicate with it. However, 200ms seems ridiculously high - is that >>>>>>>> for a >>>>>>>> single read()? Or many small reads of small amounts? If you can make a >>>>>>>> small standalone testcase showing the issue, that would be useful for >>>>>>>> benchmarking. >>>>>>>> >>>>>>>> A possibility is that the blocking is the issue, and the main >>>>>>>> thread is busy with something else. >>>>>>>> >>>>>>>> On Wed, Nov 4, 2015 at 7:57 AM, Robert Goulet < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> we are using the new pthread support in Emscripten, and one thing >>>>>>>>> we noticed is how much slower filesystem functions are when executed >>>>>>>>> in a >>>>>>>>> thread. We saw this in the documentation: >>>>>>>>> >>>>>>>>> *Currently several of the functions in the C runtime, such as >>>>>>>>>> filesystem functions like fopen(), fread(), printf(), fprintf() etc. >>>>>>>>>> are >>>>>>>>>> not multithreaded, but instead their execution is proxied over to >>>>>>>>>> the main >>>>>>>>>> application thread.* >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm just trying to understand what we are dealing with. At this >>>>>>>>> point I am guessing this is the reason why it is much slower to read >>>>>>>>> a file >>>>>>>>> using fread in a thread. We are seeing 1000x slowdowns compared to >>>>>>>>> running >>>>>>>>> in the main thread directly. For example, running in main thread, a >>>>>>>>> read >>>>>>>>> request can complete in 0.2ms, while in a thread is takes 200ms. Most >>>>>>>>> likely that's the overhead of waiting on the main thread to process >>>>>>>>> proxied >>>>>>>>> requests? >>>>>>>>> >>>>>>>>> Is there any technical blockers preventing filesystem functions to >>>>>>>>> be multithreaded so that they are no longer put in the main thread >>>>>>>>> proxy >>>>>>>>> queue? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "emscripten-discuss" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "emscripten-discuss" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "emscripten-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "emscripten-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
