[
https://issues.apache.org/jira/browse/AIRFLOW-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797228#comment-16797228
]
ASF GitHub Bot commented on AIRFLOW-3615:
-----------------------------------------
ashb commented on pull request #4591: [AIRFLOW-3615] Parse hostname using netloc
URL: https://github.com/apache/airflow/pull/4591
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Connection parsed from URI - case-insensitive UNIX socket paths in python 2.7
> -> 3.5 (but not in 3.6)
> ------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-3615
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3615
> Project: Apache Airflow
> Issue Type: Bug
> Reporter: Jarek Potiuk
> Assignee: Kamil Bregula
> Priority: Major
>
> There is a problem with case sensitivity of parsing URI for database
> connections which are using local UNIX sockets rather than TCP connection.
> In case of local UNIX sockets the hostname part of the URI contains
> url-encoded local socket path rather than actual hostname and in case this
> path contains uppercase characters, urlparse will deliberately lowercase them
> when parsing. This is perfectly fine for hostnames (according to
> [https://tools.ietf.org/html/rfc3986#section-6.2.3)] case normalisation
> should be done for hostnames.
> However urlparse still uses hostname if the URI does not contain host but
> only local path (i.e. when the location starts with %2F ("/")). What's more -
> the host gets converted to lowercase for python 2.7 - 3.5. Surprisingly this
> is somewhat "fixed" in 3.6 (i.e if the URL location starts with %2F, the
> hostname is not normalized to lowercase any more ! - see below snippets
> showing the behaviours for different python versions) .
> In Airflow's Connection this problem bubbles up. Airflow uses urlparse to get
> the hostname/path in models.py:parse_from_uri and in case of UNIX sockets it
> is done via hostname. There is no other, reliable way when using urlparse
> because the path can also contain 'authority' (user/password) and this is
> urlparse's job to separate them out. The Airflow's Connection similarly does
> not make a distinction of TCP vs. local socket connection and it uses host
> field to store the socket path (it's case sensitive however). So you can use
> UPPERCASE when you define connection in the database, but this is a problem
> for parsing connections from environment variables, because we currently
> cannot pass a URI where socket path contains UPPERCASE characters.
> Since urlparse is really there to parse URLs and it is not good for parsing
> non-URL URIs - we should likely use different parser which handles more
> generic URIs - including non-lowercasing path for all versions of python.
> I think we could also consider adding local path to Connection model and use
> it instead of hostname to store the socket path. This approach would be the
> "correct" one, but it might introduce some compatibility issues, so maybe
> it's not worth, considering that host is case sensitive in Airflow.
> Snippet showing urlparse behaviour in different python versions:
> {quote}Python 2.7.10 (default, Aug 17 2018, 19:45:58)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from urlparse import urlparse,unquote
> >>> conn = urlparse("http://AAA")
> >>> conn.hostname
> 'aaa'
> >>> conn = urlparse("http://%2FAAA")
> >>> conn.hostname
> '%2faaa'
> {quote}
>
> {quote}Python 3.5.4 (v3.5.4:3f56838976, Aug 7 2017, 12:56:33)
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from urlparse import urlparse,unquote
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ImportError: No module named 'urlparse'
> >>> from urllib.parse import urlparse,unquote
> >>> conn = urlparse("http://AAA")
> >>> conn.hostname
> 'aaa'
> >>> conn = urlparse("http://%2FAAA")
> >>> conn.hostname
> '%2faaa'
> {quote}
>
> {quote}Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 03:02:14)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from urllib.parse import urlparse,unquote
> >>> conn = urlparse("http://AAA")
> >>> conn.hostname
> 'aaa'
> >>> conn = urlparse("http://%2FAAA")
> >>> conn.hostname
> {color:#ff0000}'%2FAAA'{color}
> {quote}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)