Jarek Potiuk created AIRFLOW-3615:
-------------------------------------

             Summary: Connection parsed from URI - case-insensitive UNIX socket 
paths in python 2.7 -> 3.5 (but not in 3.6) 
                 Key: AIRFLOW-3615
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3615
             Project: Apache Airflow
          Issue Type: Bug
            Reporter: Jarek Potiuk


There is a problem with case sensitivity of parsing URI for database 
connections which are using local UNIX sockets rather than TCP connection.

In case of local UNIX sockets the hostname part of the URI contains url-encoded 
local socket path rather than actual hostname and incase this path contains 
uppercase characters, urlparse will deliberately lowercase them when parsing. 
This is perfectly fine for hostnames (according to 
[https://tools.ietf.org/html/rfc3986#section-6.2.3)] case normalisation should 
be done for hostnames.

However urlparse still uses hostname if the URI does not contain host but only 
local path (i.e. when the location starts with %2F ("/")). What's more - the 
host gets converted to lowercase for python 2.7 - 3.5. Surprisingly this is 
somewhat "fixed" in 3.6 (i.e if the URL location starts with %2F, the hostname 
is not normalized to lowercase any more ! - see below snippets showing the 
behaviours for different python versions) .

In Airflow's Connection this problem bubbles up. Airflow uses urlparse to get 
the hostname/path in models.py:parse_from_uri and in case of UNIX sockets it is 
done via hostname. There is no other, reliable way when using urlparse because 
the path can also contain 'authority' (user/password) and this is urlparse's 
job to separate them out. The Airflow's Connection similarly does not make a 
distinction of TCP vs. local socket connection and it uses host field to store 
the  socket path (it's case sensitive however). So you can use UPPERCASE when 
you define connection in the database, but this is a problem for parsing 
connections from environment variables, because we currently cannot pass a 
connection where socket path contains UPPERCASE characters.

Since urlparse is really there to parse URLs and it is not good for parsing 
non-URL URIs - we should likely use different parser which handles more generic 
URIs - including non-lowercasing path for all versions of python.

I think we could also consider adding local path to Connection model and use it 
instead of hostname to store the socket path. This approach would be the 
"correct" one, but it might introduce some compatibility issues, so maybe it's 
not worth, considering that host is case sensitive in Airflow.

Snippet showing urlparse behaviour in different python versions:
{quote}Python 2.7.10 (default, Aug 17 2018, 19:45:58)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> from urlparse import urlparse,unquote
 >>> conn = urlparse("http://AAA";)
 >>> conn.hostname
 'aaa'
 >>> conn = urlparse("http://%2FAAA";)
 >>> conn.hostname
 '%2faaa'
{quote}
 
{quote}Python 3.5.4 (v3.5.4:3f56838976, Aug 7 2017, 12:56:33)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> from urlparse import urlparse,unquote
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 ImportError: No module named 'urlparse'
 >>> from urllib.parse import urlparse,unquote
 >>> conn = urlparse("http://AAA";)
 >>> conn.hostname
 'aaa'
 >>> conn = urlparse("http://%2FAAA";)
 >>> conn.hostname
 '%2faaa'
{quote}
 
{quote}Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> from urllib.parse import urlparse,unquote
 >>> conn = urlparse("http://AAA";)
 >>> conn.hostname
 'aaa'
 >>> conn = urlparse("http://%2FAAA";)
 >>> conn.hostname
 {color:#FF0000}'%2FAAA'{color}
{quote}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to