I would definitely expect a postMessage to have some noticeable lag -  but
I would expect a few ms, not the large amounts we see here. Still, perhaps
we can fix this by removing the postMessage entirely. In theory maybe we
could wait on a mutex/futex on the main thread, if we are not doing
anything else? We could only do it for the duration of a single frame, I
suppose, but then at least if called during that time, we would
instantaneously respond.

On Thu, Nov 5, 2015 at 8:47 AM, Robert Goulet <[email protected]>
wrote:

> Ok I figured out a way to profile both the
> emscripten_sync_run_in_main_thread and
> emscripten_main_thread_process_queued_calls function and I found that
> 99.9% of the time is spent waiting on the futex and the end of
> emscripten_sync_run_in_main_thread. The mutex lock/unlock themselves
> takes on average 0.01ms so its negligible. Processing the call itself takes
> on average 0.65ms.
>
> @Floh Looks like all the time is spent on the context switching between
> the thread and the main thread. Perhaps what happens is exactly what you
> suggested - main thread loop isn't called as fast as we think, leaving a
> big gap between the time we postMessage and the time its received. Or
> perhaps its just the message mechanics of WebWorkers that is just slow?
>
>
> On Thursday, November 5, 2015 at 10:46:30 AM UTC-5, Robert Goulet wrote:
>>
>> Seems like we are dealing with many things here:
>>
>>    1. Queue's mutex lock to add the call to the queue
>>    (emscripten_sync_run_in_main_thread)
>>    2. Then we call postMessage (not sure how it works, could there be
>>    some blocking here?) to notify we have something in queue
>>    3. Then queue's mutex unlock
>>    4. Then we block wait on a futex for the operation to complete
>>    5. At some point, the main thread wakes up and process the calls in
>>    the queue (emscripten_main_thread_process_queued_calls)
>>    6. Queue's mutex lock to read and empty the queue
>>    7. Queue's mutex unlock
>>
>> I see one area that could be improved,
>> emscripten_main_thread_process_queued_calls: Instead of keeping the mutex
>> locked while we process all queue calls, we could just fetch them on a
>> stack variable and unlock the mutex immediately, allowing other threads to
>> add to the queue while they are being processed, but I'm not sure this
>> would improve the situation here since anyway we are waiting on the futex
>> to complete the fread operation. I'll try to profile the actual locking
>> mechanic to see if its the mutex, or the context switching using the
>> message that is slow.
>>
>> On Thursday, November 5, 2015 at 10:08:08 AM UTC-5, Floh wrote:
>>>
>>> Sounds like you need some sort of read-ahead, where a small read would
>>> actually read a larger portion of the file into a (size-configurable)
>>> buffer on the worker-thread side with a single round-trip to the
>>> main-thread, and subsequent small reads wouldn't require a main-thread
>>> round-trip until the end of the buffer is reached, and the next read would
>>> fill up the buffer again...
>>>
>>> Disclaimer: I haven't looked at the code, even if every small read does
>>> a complete, blocking thread-roundtrip, this doesn't explain the high
>>> blocking times you are seeing. I guess there is another problem that a
>>> single read has a too high blocking time (may be it can only do at most 1
>>> read per frame, or something similar).
>>>
>>> Cheers,
>>> -Floh.
>>>
>>> Am Donnerstag, 5. November 2015 15:09:04 UTC+1 schrieb Robert Goulet:
>>>>
>>>> Where is the code that deals with this locking mechanics? I'd like to
>>>> take a look at it.
>>>>
>>>> Also, it doesn't matter much for us how long the entire process takes.
>>>> It's more each read time that matters. If we can improve that, would be
>>>> great!
>>>>
>>>> On Wednesday, November 4, 2015 at 5:22:32 PM UTC-5, Alon Zakai wrote:
>>>>>
>>>>> Thanks for the testcase. I see the same results.
>>>>>
>>>>> It looks like reducing the number of reads helps a lot. The overhead
>>>>> is affected mostly by number of reads. Which could make sense if the main
>>>>> thread is busy (since it's the main browser thread, it could be busy doing
>>>>> anything from rendering to doing some work for another tab) and the worker
>>>>> needs to wait on it. Also, we send a message to the main thread, so any
>>>>> general activity on the event queue could lead to the message being
>>>>> received later.
>>>>>
>>>>> It's also possible that the mutex and futex stuff we do for the
>>>>> blocking call has overhead. Jukka, do we have a way to profile that?
>>>>>
>>>>>
>>>>> On Wed, Nov 4, 2015 at 2:06 PM, Robert Goulet <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Here, I quickly wrote this small test case
>>>>>> <https://autodesk.box.com/s/h12u5woqkuwyzg4uk7hw6puck8jg8orh> which
>>>>>> reproduce the problem.
>>>>>>
>>>>>> I get the following output from it:
>>>>>>
>>>>>> Preallocating 1 workers for a pthread spawn pool.
>>>>>> Writing test file (1048576 bytes)...
>>>>>> Reading file from main thread...
>>>>>> Completed in 6.730000ms (~0.205156ms per read of 32768 bytes)
>>>>>> Reading file from another thread...
>>>>>> Completed in 3499.585000ms (~109.355312ms per read of 32768 bytes)
>>>>>> Done.
>>>>>>
>>>>>> Please let me know if there's anything we can do to fix this major
>>>>>> difference between the two.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Wednesday, November 4, 2015 at 3:25:58 PM UTC-5, Robert Goulet
>>>>>> wrote:
>>>>>>>
>>>>>>> That is for many small/medium reads. It seems the size of the read
>>>>>>> does not have any impact on performance thought.
>>>>>>>
>>>>>>> Essentially, the thread we create does the following:
>>>>>>>
>>>>>>> while (true) {
>>>>>>> _queue_semaphore.wait();
>>>>>>> if (_exit_thread)
>>>>>>> break;
>>>>>>> process_request();
>>>>>>> }
>>>>>>>
>>>>>>> in that case, the process_request() function takes 200ms+ to execute
>>>>>>> (profiled within the process_request() function itself, so that it does 
>>>>>>> not
>>>>>>> include locking mechanics overhead).
>>>>>>>
>>>>>>> If we just call process_request() right away upon adding request
>>>>>>> instead of inserting the request in the thread queue (which essentially
>>>>>>> just bypass the thread completely), the process_request() function takes
>>>>>>> <0.2ms to execute. The only thing this function does is a switch case
>>>>>>> between fopen(), fread() and fclose(). I've narrowed it down to being 
>>>>>>> these
>>>>>>> filesystem function who are taking a much longer time to return.
>>>>>>>
>>>>>>> The main thread is waiting on the thread queue to complete before
>>>>>>> returning, so I don't see why it would block the thread from doing its 
>>>>>>> work?
>>>>>>>
>>>>>>> On Wednesday, November 4, 2015 at 12:53:21 PM UTC-5, Alon Zakai
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The filesystem itself resides in JS, which can only be accessed
>>>>>>>> from the main thread. Workers therefore need to send messages to
>>>>>>>> communicate with it. However, 200ms seems ridiculously high - is that 
>>>>>>>> for a
>>>>>>>> single read()? Or many small reads of small amounts? If you can make a
>>>>>>>> small standalone testcase showing the issue, that would be useful for
>>>>>>>> benchmarking.
>>>>>>>>
>>>>>>>> A possibility is that the blocking is the issue, and the main
>>>>>>>> thread is busy with something else.
>>>>>>>>
>>>>>>>> On Wed, Nov 4, 2015 at 7:57 AM, Robert Goulet <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> we are using the new pthread support in Emscripten, and one thing
>>>>>>>>> we noticed is how much slower filesystem functions are when executed 
>>>>>>>>> in a
>>>>>>>>> thread. We saw this in the documentation:
>>>>>>>>>
>>>>>>>>> *Currently several of the functions in the C runtime, such as
>>>>>>>>>> filesystem functions like fopen(), fread(), printf(), fprintf() etc. 
>>>>>>>>>> are
>>>>>>>>>> not multithreaded, but instead their execution is proxied over to 
>>>>>>>>>> the main
>>>>>>>>>> application thread.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm just trying to understand what we are dealing with. At this
>>>>>>>>> point I am guessing this is the reason why it is much slower to read 
>>>>>>>>> a file
>>>>>>>>> using fread in a thread. We are seeing 1000x slowdowns compared to 
>>>>>>>>> running
>>>>>>>>> in the main thread directly. For example, running in main thread, a 
>>>>>>>>> read
>>>>>>>>> request can complete in 0.2ms, while in a thread is takes 200ms. Most
>>>>>>>>> likely that's the overhead of waiting on the main thread to process 
>>>>>>>>> proxied
>>>>>>>>> requests?
>>>>>>>>>
>>>>>>>>> Is there any technical blockers preventing filesystem functions to
>>>>>>>>> be multithreaded so that they are no longer put in the main thread 
>>>>>>>>> proxy
>>>>>>>>> queue?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "emscripten-discuss" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to [email protected].
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "emscripten-discuss" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
> You received this message because you are subscribed to the Google Groups
> "emscripten-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to