Seems like we are dealing with many things here:

   1. Queue's mutex lock to add the call to the queue 
   (emscripten_sync_run_in_main_thread)
   2. Then we call postMessage (not sure how it works, could there be some 
   blocking here?) to notify we have something in queue
   3. Then queue's mutex unlock
   4. Then we block wait on a futex for the operation to complete
   5. At some point, the main thread wakes up and process the calls in the 
   queue (emscripten_main_thread_process_queued_calls)
   6. Queue's mutex lock to read and empty the queue
   7. Queue's mutex unlock

I see one area that could be improved, 
emscripten_main_thread_process_queued_calls: Instead of keeping the mutex 
locked while we process all queue calls, we could just fetch them on a 
stack variable and unlock the mutex immediately, allowing other threads to 
add to the queue while they are being processed, but I'm not sure this 
would improve the situation here since anyway we are waiting on the futex 
to complete the fread operation. I'll try to profile the actual locking 
mechanic to see if its the mutex, or the context switching using the 
message that is slow.

On Thursday, November 5, 2015 at 10:08:08 AM UTC-5, Floh wrote:
>
> Sounds like you need some sort of read-ahead, where a small read would 
> actually read a larger portion of the file into a (size-configurable) 
> buffer on the worker-thread side with a single round-trip to the 
> main-thread, and subsequent small reads wouldn't require a main-thread 
> round-trip until the end of the buffer is reached, and the next read would 
> fill up the buffer again...
>
> Disclaimer: I haven't looked at the code, even if every small read does a 
> complete, blocking thread-roundtrip, this doesn't explain the high blocking 
> times you are seeing. I guess there is another problem that a single read 
> has a too high blocking time (may be it can only do at most 1 read per 
> frame, or something similar).
>
> Cheers,
> -Floh.
>
> Am Donnerstag, 5. November 2015 15:09:04 UTC+1 schrieb Robert Goulet:
>>
>> Where is the code that deals with this locking mechanics? I'd like to 
>> take a look at it.
>>
>> Also, it doesn't matter much for us how long the entire process takes. 
>> It's more each read time that matters. If we can improve that, would be 
>> great!
>>
>> On Wednesday, November 4, 2015 at 5:22:32 PM UTC-5, Alon Zakai wrote:
>>>
>>> Thanks for the testcase. I see the same results.
>>>
>>> It looks like reducing the number of reads helps a lot. The overhead is 
>>> affected mostly by number of reads. Which could make sense if the main 
>>> thread is busy (since it's the main browser thread, it could be busy doing 
>>> anything from rendering to doing some work for another tab) and the worker 
>>> needs to wait on it. Also, we send a message to the main thread, so any 
>>> general activity on the event queue could lead to the message being 
>>> received later.
>>>
>>> It's also possible that the mutex and futex stuff we do for the blocking 
>>> call has overhead. Jukka, do we have a way to profile that?
>>>
>>>
>>> On Wed, Nov 4, 2015 at 2:06 PM, Robert Goulet <[email protected]> 
>>> wrote:
>>>
>>>> Here, I quickly wrote this small test case 
>>>> <https://autodesk.box.com/s/h12u5woqkuwyzg4uk7hw6puck8jg8orh> which 
>>>> reproduce the problem.
>>>>
>>>> I get the following output from it:
>>>>
>>>> Preallocating 1 workers for a pthread spawn pool.
>>>> Writing test file (1048576 bytes)...
>>>> Reading file from main thread...
>>>> Completed in 6.730000ms (~0.205156ms per read of 32768 bytes)
>>>> Reading file from another thread...
>>>> Completed in 3499.585000ms (~109.355312ms per read of 32768 bytes)
>>>> Done.
>>>>
>>>> Please let me know if there's anything we can do to fix this major 
>>>> difference between the two.
>>>>
>>>> Thanks!
>>>>
>>>> On Wednesday, November 4, 2015 at 3:25:58 PM UTC-5, Robert Goulet wrote:
>>>>>
>>>>> That is for many small/medium reads. It seems the size of the read 
>>>>> does not have any impact on performance thought.
>>>>>
>>>>> Essentially, the thread we create does the following:
>>>>>
>>>>> while (true) {
>>>>> _queue_semaphore.wait();
>>>>> if (_exit_thread)
>>>>> break;
>>>>> process_request();
>>>>> }
>>>>>
>>>>> in that case, the process_request() function takes 200ms+ to execute 
>>>>> (profiled within the process_request() function itself, so that it does 
>>>>> not 
>>>>> include locking mechanics overhead).
>>>>>
>>>>> If we just call process_request() right away upon adding request 
>>>>> instead of inserting the request in the thread queue (which essentially 
>>>>> just bypass the thread completely), the process_request() function takes 
>>>>> <0.2ms to execute. The only thing this function does is a switch case 
>>>>> between fopen(), fread() and fclose(). I've narrowed it down to being 
>>>>> these 
>>>>> filesystem function who are taking a much longer time to return.
>>>>>
>>>>> The main thread is waiting on the thread queue to complete before 
>>>>> returning, so I don't see why it would block the thread from doing its 
>>>>> work?
>>>>>
>>>>> On Wednesday, November 4, 2015 at 12:53:21 PM UTC-5, Alon Zakai wrote:
>>>>>>
>>>>>> The filesystem itself resides in JS, which can only be accessed from 
>>>>>> the main thread. Workers therefore need to send messages to communicate 
>>>>>> with it. However, 200ms seems ridiculously high - is that for a single 
>>>>>> read()? Or many small reads of small amounts? If you can make a small 
>>>>>> standalone testcase showing the issue, that would be useful for 
>>>>>> benchmarking.
>>>>>>
>>>>>> A possibility is that the blocking is the issue, and the main thread 
>>>>>> is busy with something else.
>>>>>>
>>>>>> On Wed, Nov 4, 2015 at 7:57 AM, Robert Goulet <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> we are using the new pthread support in Emscripten, and one thing we 
>>>>>>> noticed is how much slower filesystem functions are when executed in a 
>>>>>>> thread. We saw this in the documentation:
>>>>>>>
>>>>>>> *Currently several of the functions in the C runtime, such as 
>>>>>>>> filesystem functions like fopen(), fread(), printf(), fprintf() etc. 
>>>>>>>> are 
>>>>>>>> not multithreaded, but instead their execution is proxied over to the 
>>>>>>>> main 
>>>>>>>> application thread.*
>>>>>>>
>>>>>>>
>>>>>>> I'm just trying to understand what we are dealing with. At this 
>>>>>>> point I am guessing this is the reason why it is much slower to read a 
>>>>>>> file 
>>>>>>> using fread in a thread. We are seeing 1000x slowdowns compared to 
>>>>>>> running 
>>>>>>> in the main thread directly. For example, running in main thread, a 
>>>>>>> read 
>>>>>>> request can complete in 0.2ms, while in a thread is takes 200ms. Most 
>>>>>>> likely that's the overhead of waiting on the main thread to process 
>>>>>>> proxied 
>>>>>>> requests?
>>>>>>>
>>>>>>> Is there any technical blockers preventing filesystem functions to 
>>>>>>> be multithreaded so that they are no longer put in the main thread 
>>>>>>> proxy 
>>>>>>> queue?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "emscripten-discuss" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "emscripten-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to