[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

GitBox Mon, 11 Jul 2022 17:28:10 -0700


potiuk commented on PR #24981:
URL: https://github.com/apache/airflow/pull/24981#issuecomment-1181168840


   > i am curious.... if the issue is just a changing representation, would it 
suffice to _only_ cache the result? or do we also need the other logic? or if 
the other logic is good, do we still need to cache?
   
   This is defensive/protective approach to protect agains both - wrong FQDN 
and one that changes.  I believe both changes should halp with cases like 
https://github.com/apache/airflow/discussions/20269#discussioncomment-3095616 -
   
   But I have no hard proof, unfortunately - this is quite a bit leap of faith 
- an attempt to get more stable solution without a way to easily test that it 
actually solves the problem. But seems no-one else was interested in discussing 
it when I raised the question several times in development slack of ours, so I 
decided to do some more search and make a  PR.
   
   > like if we had issues with the value changing... what was the problem? 
does the "old" value become non-functional or something? or do they both tell 
the truth?
   
   Context:
   
   I think (but I am not 100% sure) they MIGHT or MIGHT NOT work - depending on 
the sate of networking and DNS stabilty. From what I read in the description of 
https://github.com/python/cpython/issues/49254 - socket.getfqdn() in hosts that 
have DNS-only resolver (which is the case for K8S I believe) - might sometimes 
get wrong, non-canonical name. It depends on many factors:
   
   * whether the host support IPV4 or IPV6 
   * how many networking interface there are
   * whether the host is registered with some of the addresses with a name that 
contain "." (so it does not have to be "fully qualified name). It can be 
shorter name but it should contain "."
   * what is the sequence of those addresses returned 
   * whether internal DNS of K8S cluster is not too busy and can respond 
quickly enough for each of those addresses
   
   The `getfqdn` takes the shortcut that it will return the first name derived 
from IP addresses with gethostbyaddr() which contains "." in the name. This is 
the gist of the issue - that it is not **always** the canonical name. Mostly 
yes. But not always. And I think both names would normally be reachable by both 
names, but we are using the name in a few places as "verification" and 
"consistency check" (and raise exceptions if expected hostname does not match 
the observed one).
   
   Kubernetes's networfking is complex and the result might change depending on 
the DNS responses/registration of the PODs/Containers in the various networking 
interfaces (usually there are multiple networking interfaces that each 
Pod/Container has - and it also depennds on your K8S comfiguration, including 
some security rulles, istio, VPN, security zones, ingress, what networking 
virtualization is used etc. etc. Virtualisation of Networkin K8S is the 
centerpiece of how it works and it is pretty complex subject. Just look here 
https://kubernetes.io/docs/concepts/cluster-administration/networking/ to see 
how many network vitualization options are possible. The list goes on and on.
   
   My take:
   
   The getfqdn change (which is borrowed from 
https://github.com/borgbackup/borg/issues/3471) changes the retrieval mechanism 
to use getaddrinfo with hint that only canonical name should be returned. 
HOPEFULLY this will work better. But I do not know. This is based on evidence 
from some people from borgbackup and my understanding how it works. 
   
   From: https://man7.org/linux/man-pages/man3/getaddrinfo.3.html
   
   >  If hints.ai_flags includes the AI_CANONNAME flag, then the ai_canonname 
field of the first of the addrinfo structures in the returned list is set to 
point to the official name of the host.
   
   My hope is that (similarly as in case if borgbackup) it will get more stable 
results. But I am not sure, and I am not able to test that it will be always 
100% accurate. The issue is open for 13 yars - and I believe the main reason is 
tha "volatiity" of the behaviour. So I am not sure if we will ever be able to 
have "hard" data on it. We need to act based on intuition here, I am afraid and 
more "educated guesses". And likely we will never know if it worked, because it 
happens intermittently and is not reproduceable. I've seen quite a few cases 
questions raised in slack/discussions (but usually there was no hard 
reproduction steps - they were mostly anecdotal evidence, but I saw it 
frequently enought to believe it is happening).
   
   So I also implemented caching. This should help in the way that one 
interpreter should only get the name once. I think some of the issues  might 
come from the fact the same interpreter retrieves different fqdn at different 
times. Caching should help there. And It should help to at least keep 
consistency even if we have multiple interpreters but the workers are spawned 
wiht "forks" - because if the cache is populated before forking, it will stay 
there.
   
   This is purely defensive to implement both. I have currently not enough data 
to determine which of those changes is really needed and which one will help to 
get less of the problem. So I chose to implement both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on pull request #24981: Patch getfqdn with more resilient version

Reply via email to