Committed here:

https://github.com/python/asyncio/commit/39c135baf73762830148236da622787052efba19

Yury would you please update the CPython repo when the time is right?

On Wednesday, December 9, 2015 at 6:05:46 PM UTC-5, A. Jesse Jiryu Davis 
wrote:
>
> Done:
>
> https://github.com/python/asyncio/pull/302
>
> On Wednesday, December 9, 2015 at 2:58:07 PM UTC-5, Guido van Rossum wrote:
>>
>> It'll go much quicker if you send a PR to the asyncio github project. 
>> Thanks!
>>
>> --Guido (mobile)
>> On Dec 9, 2015 11:56 AM, "A. Jesse Jiryu Davis" <[email protected]> 
>> wrote:
>>
>>> Thanks for your kind words about the crawler chapter. =) I think we 
>>> didn't hit this problem there because the crawler only looked up one 
>>> domain, or a few of them. On Mac, subsequent calls are cached at the OS 
>>> layer for a few minutes, so getaddrinfo is very fast, even though it's 
>>> serialized by the getaddrinfo lock. Besides I don't think the crawler has a 
>>> connection timeout, so if it *were *looking up hundreds of domains it 
>>> would wait as long as necessary for the lock.
>>>
>>> In this bug report, on the other hand, there are hundreds of coroutines 
>>> all waiting to look up different domains, and some of those domains take 
>>> several seconds to resolve. Resolving hundreds of different domains, plus 
>>> the getaddrinfo lock, plus Motor's 20-second connection timeout, all 
>>> combine to create this bad behavior.
>>>
>>> I like option #4 also; that gives me the freedom to either cache lookups 
>>> in Motor, or to treat "localhost" specially, or at least to prevent a slow 
>>> getaddrinfo from appearing to be a connection timeout. I haven't decided 
>>> which is the best approach, but #4 allows me flexibility. Would you like me 
>>> to write the patch or will you?
>>>
>>> I'm curious about #5 too, but I think that's a longer term project.
>>>
>>> On Wednesday, December 9, 2015 at 12:57:53 PM UTC-5, Guido van Rossum 
>>> wrote:
>>>>
>>>> 4. Modify asyncio's getaddrinfo() so that if you pass it something that 
>>>> looks like a numerical address (IPv4 or IPv6 syntax, looking exactly like 
>>>> what getaddrinfo() returns) it skips calling socket.getaddrinfo(). I've 
>>>> wanted this for a while but hadn't run into this queue issue, so this is 
>>>> my 
>>>> favorite.
>>>>
>>>> 5. Do the research needed to prove that socket.getaddrinfo() on OS X 
>>>> (or perhaps on sufficiently recent versions of OS X) is thread-safe and 
>>>> submit a patch that avoids the getaddrinfo lock on those versions of OS X.
>>>>
>>>> FWIW, IIRC my crawler example (which your editing made so much better!) 
>>>> calls getaddrinfo() for every connection and I hadn't experienced any 
>>>> slowness there. But maybe in the Motor example you're hitting a slow DNS 
>>>> server? (I'm guessing there are lots of system configuration parameters 
>>>> that may make the system's getaddrinfo() slower or faster, and in your 
>>>> setup it may well be slower.)
>>>>
>>>> On Wed, Dec 9, 2015 at 7:05 AM, A. Jesse Jiryu Davis <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Guido, this all makes sense.
>>>>>
>>>>> One problem, though, is that even if I call getaddrinfo myself in 
>>>>> Motor and wrap a cache around it, I *still* can't use the event 
>>>>> loop's create_connection() call. create_connection always 
>>>>> calls getaddrinfo, and even though it "should be quick", that doesn't 
>>>>> matter in this scenario: the time is spent waiting for the getaddrinfo 
>>>>> lock. So I would need one of:
>>>>>
>>>>> 1. Copy the body of create_connection into Motor so I can customize 
>>>>> the getaddrinfo call. I especially don't like this because I'd have to 
>>>>> call 
>>>>> the loop's private _create_connection_transport() from Motor's 
>>>>> customized create_connection(). Perhaps we could 
>>>>> make _create_connection_transport public, or otherwise make a public API 
>>>>> for separating getaddrinfo from actually establishing the connection?
>>>>>
>>>>> 2. Make getaddrinfo customizable in asyncio (
>>>>> https://github.com/python/asyncio/issues/160). This isn't ideal, 
>>>>> since it requires Motor users on Mac / BSD to change configuration for 
>>>>> the 
>>>>> whole event loop just so Motor's specific create_connection calls behave 
>>>>> correctly.
>>>>>
>>>>> 3. Back to the original proposal: add a connection timeout parameter 
>>>>> to create_connection. =)
>>>>>
>>>>> On Tuesday, December 8, 2015 at 4:30:04 PM UTC-5, Guido van Rossum 
>>>>> wrote:
>>>>>
>>>>>> On Tue, Dec 8, 2015 at 7:13 AM, A. Jesse Jiryu Davis <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi, a Motor user began an interesting discussion on the MongoDB-user 
>>>>>>> list:
>>>>>>>
>>>>>>> https://groups.google.com/d/topic/mongodb-user/2oK6C3BrVKI/discussion
>>>>>>>
>>>>>>> The summary is this: he's fetching hundreds of URLs concurrently and 
>>>>>>> inserting the results into MongoDB with Motor. Motor throws lots of 
>>>>>>> connection-timeout errors. The problem is getaddrinfo: on Mac, Python 
>>>>>>> only 
>>>>>>> allows one getaddrinfo call at a time. With hundreds of HTTP fetches in 
>>>>>>> progress, there's a long queue waiting for the getaddrinfo lock. 
>>>>>>> Whenever 
>>>>>>> Motor wants to grow its connection pool it has to call getaddrinfo on 
>>>>>>> "localhost", and it spends so long waiting for that call, it times out 
>>>>>>> and 
>>>>>>> thinks it can't reach MongoDB.
>>>>>>>
>>>>>>
>>>>>> If it's really looking up "localhost" over and over, maybe wrap a 
>>>>>> cache around getaddrinfo()?
>>>>>>  
>>>>>>
>>>>>>> Motor's connection-timeout implementation in asyncio is sort of 
>>>>>>> wrong:
>>>>>>>
>>>>>>>     coro = asyncio.open_connection(host, port)
>>>>>>>     sock = yield from asyncio.wait_for(coro, timeout)
>>>>>>>
>>>>>>> The timer runs during the call to getaddrinfo, as well as the call 
>>>>>>> to the loop's sock_connect(). This isn't the intention: the timeout 
>>>>>>> should 
>>>>>>> apply only to the connection.
>>>>>>>
>>>>>>> A philosophical digression: The "connection timeout" is a heuristic. 
>>>>>>> "If I've waited N seconds and haven't established the connection, I 
>>>>>>> probably never will. Give up." Based on what they know about their own 
>>>>>>> networks, users can tweak the connection timeout. In a fast network, a 
>>>>>>> server that hasn't responded in 20ms is probably down; but on a global 
>>>>>>> network, 10 seconds might be reasonable. Regardless, the heuristic only 
>>>>>>> applies to the actual TCP connection. Waiting for getaddrinfo is not 
>>>>>>> related; that's up to the operating system.
>>>>>>>
>>>>>>> In a multithreaded client like PyMongo we distinguish the two phases:
>>>>>>>
>>>>>>>     for res in socket.getaddrinfo(host, port, family, 
>>>>>>> socket.SOCK_STREAM):
>>>>>>>         af, socktype, proto, dummy, sa = res
>>>>>>>         sock = socket.socket(af, socktype, proto)
>>>>>>>         try:
>>>>>>>             sock.settimeout(connect_timeout)
>>>>>>>             
>>>>>>>             # THE TIMEOUT ONLY APPLIES HERE.
>>>>>>>             sock.connect(sa)
>>>>>>>             sock.settimeout(None)
>>>>>>>             return sock
>>>>>>>         except socket.error as e:
>>>>>>>             # Connection refused, or not established within the 
>>>>>>> timeout.
>>>>>>>             sock.close()
>>>>>>>
>>>>>>> Here, the call to getaddrinfo isn't timed at all, and each distinct 
>>>>>>> attempt to connect on a different address is timed separately. So this 
>>>>>>> kind 
>>>>>>> of code matches the idea of a "connect timeout" as a heuristic for 
>>>>>>> deciding 
>>>>>>> whether the server is down.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>> 1. Should asyncio.open_connection support a connection timeout that 
>>>>>>> acts like the blocking version above? That is, a connection timeout 
>>>>>>> that 
>>>>>>> does not include getaddrinfo, and restarts for each address we attempt 
>>>>>>> to 
>>>>>>> connect to?
>>>>>>>
>>>>>>
>>>>>> Hm, I don't really like adding timeouts to every API. As you describe 
>>>>>> everyone has different needs. IMO if you don't want the timeout to cover 
>>>>>> the getaddrinfo() call, call getaddrinfo() yourself and pass the host 
>>>>>> address into the create_connection() call. That way you also have 
>>>>>> control 
>>>>>> over whether to e.g. implement "happy eyeballs". (It will still call 
>>>>>> socket.getaddrinfo(), but it should be quick -- it's not going to a DNS 
>>>>>> server or even /etc/hosts to discover that 127.0.0.1 maps to 127.0.0.1.)
>>>>>>  
>>>>>>
>>>>>>>
>>>>>>> 2. Why does Python lock around getaddrinfo on Mac and Windows 
>>>>>>> anyway? The code comment says these are "systems on which getaddrinfo() 
>>>>>>> is 
>>>>>>> believed to not be thread-safe". Has this belief ever been confirmed?
>>>>>>>
>>>>>>>
>>>>>>> https://hg.python.org/cpython/file/d2b8354e87f5/Modules/socketmodule.c#l185
>>>>>>>
>>>>>>
>>>>>> I don't know -- the list of ifdefs seems to indicate this is a 
>>>>>> generic BSD issue, which is OS X's heritage. Maybe someone can do an 
>>>>>> experiment, or review the source code used by Apple (if it's still open 
>>>>>> source)? While I agree that if this really isn't an issue we shouldn't 
>>>>>> bother with the lock, I'd also much rather be safe than sorry when it 
>>>>>> comes 
>>>>>> to races in core Python.
>>>>>>
>>>>>> -- 
>>>>>> --Guido van Rossum (python.org/~guido)
>>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>> --Guido van Rossum (python.org/~guido)
>>>>
>>>

Reply via email to