Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

Andrei Zh Tue, 25 Aug 2015 14:07:47 -0700

I submitted a new issue <https://github.com/JuliaWeb/Requests.jl/issues/61> 
to Requests.jl describing the problem and my observations. In short, each 
separate request takes a reasonable time, but when I launch a lot of them 
in tasks, they become very slow.



On Tuesday, August 25, 2015 at 2:44:30 AM UTC+3, Andrei Zh wrote:
>
> @Jameson: setting UV_THREADPOOL_SIZE to 32 seems to reduce DNS resolution 
> time twice (from 26 to 12 seconds on my latest tests), so thank you. 
>
> However, DNS seems to be not the only root of the problem: I noticed that 
> with large number of URLs first ones get result very quickly (300-800ms), 
> but then latency begins to grow until timeout is exceeded (I'm seen more 
> then 100 seconds for some requests). 
>
> I will try to set up stable test and post it here as well as to issues in 
> Request.jl. 
>
>
> On Monday, August 24, 2015 at 8:21:51 PM UTC+3, Jameson wrote:
>>
>> If you are doing a lot of parallel dns queries, you may want to try 
>> increasing the number that can be run simultaneously but setting the 
>> UV_THREADPOOL_SIZE environment variable before starting julia to something 
>> larger (default is 4, max is 128).
>>
>> On Mon, Aug 24, 2015 at 9:17 AM Andrei Zh <[email protected]> wrote:
>>
>>> Jonathan, thanks for your support. So far I noticed that DNS gives 
>>> pretty large delay. E.g. resolving IP addresses for 1000 URLs took 80 
>>> seconds in serial code and 26 seconds in muli-task code: 
>>>
>>>
>>> Serial execution: 
>>>
>>> julia> @time for url in urls
>>>                begin
>>>                    Base.getaddrinfo(URI(url).host)
>>>                end
>>>            end
>>> elapsed time: 80.071810293 seconds (732400 bytes allocated)
>>>
>>>
>>> Multitask execution:
>>>
>>>
>>> julia> @time @sync for url in urls
>>>
>>>            @async begin                                           
>>>                Base.getaddrinfo(URI(url).host)
>>>            end
>>>        end
>>>
>>> elapsed time: 26.241893516 seconds (4277968 bytes allocated)
>>>
>>> So I'll try to pre-resolve IPs and test again. 
>>>
>>>
>>> On Monday, August 24, 2015 at 4:01:44 PM UTC+3, Jonathan Malmaud wrote:
>>>
>>>> As one of the maintainers of Requests.jl, I'm especially interested in 
>>>> its use for high-performance applications so don't hesitate to file an 
>>>> issue if it gives you any performance problems.
>>>>
>>>> On Sunday, August 23, 2015 at 7:40:08 PM UTC-4, Andrei Zh wrote:
>>>>>
>>>>> Hi Steven, 
>>>>>
>>>>> thanks for your answer! It turns out I misunderstood @async long time 
>>>>> ago, assuming it also makes a remote call to other processes and thus 
>>>>> introduces true multi-tasking. So now I need to rethink my approach 
>>>>> before 
>>>>> going further. 
>>>>>
>>>>> Just to clarify: my goal is to perform as many requests as possible at 
>>>>> the same time, so I want to use both - multiple processes (to start 
>>>>> several 
>>>>> requests at several cores in parallel) and tasks (to launch new requests 
>>>>> while old ones are still waiting for IO to complete). 
>>>>>
>>>>> So I will update my approach and come back with results or new 
>>>>> questions. 
>>>>>
>>>>>
>>>>>
>>>>> On Monday, August 24, 2015 at 2:13:23 AM UTC+3, Steven G. Johnson 
>>>>> wrote:
>>>>>>
>>>>>> @parallel in Julia is for executing separate parallel processes (true 
>>>>>> multi-tasking, with separate address spaces).  @async is for "tasks", 
>>>>>> which 
>>>>>> are "green threads" and represent cooperative multitasking (within the 
>>>>>> same 
>>>>>> process and the same address space).
>>>>>>
>>>>>> I/O in Julia is asynchronous — while one task is blocked waiting on 
>>>>>> I/O, another task will wake up and start running.  (This is based on the 
>>>>>> libuv library, which is designed for high-performance asynchronous I/O.)
>>>>>>
>>>>>> The first question is whether you want to fetch URLs in separate OS 
>>>>>> processes, or you want to use green threads within the same process.  It 
>>>>>> sounds like you want the latter, in which case @async is the right thing.
>>>>>>
>>>>>> The second question is whether something about the Requests.jl 
>>>>>> package is serializing things somehow; for that you might file an issue 
>>>>>> at 
>>>>>> Requests.jl.
>>>>>>
>>>>>

Re: [julia-users] Re: What is the fastest way to perform 100k blocking IO operations in parallel?

Reply via email to