> As for the larger issue at hand: the reason req.parsed_uri is not
> filled in is because browsers don't send the info in the GET...
Disclaimer: What follows is not an exhaustive, conclusive search by
tracing running code, but rather searching source code and watching
apache behaviour with tools like curl, telneting to the apache port
and using a browser.
Onward...
As mentioned already, req.parsed_uri is a tuple wrapping of a
request_rec.parsed_uri which is an apr_uri_t.
The contents of this struct are touched in many places, but the
primary functions setting this structure are in
srclib/apr-util/uri/apr_uri.c: apr_uri_parse() and
apr_uri_parse_hostinfo(). Doing a search within apache to see where
these functions are called I discovered a number of modules making use
of these functions, but probably not of concern to this issue. The
primary caller is ap_parse_uri() in server/protocol.c.
ap_parse_uri() is called numerous times in server/request.c to deal
with sub-requests; it is also called in modules/http/http_request.c
for internal redirects. The main calling stack which is of concern to
this issue is:
Function Called Function defined in
---------------------------------------------------------------
ap_process_http_connection() [modules/http/http_core.c]
=> ap_read_request() [server/protocol.c]
=> read_request_line() [server/protocol.c]
=> ap_parse_uri() [server/protocol.c]
=> apr_uri_parse() [srclib/apr-util/uri/apr_uri.c]
ap_parse_uri is called with a request_rec and the uri (as a string);
the string is what read_request_line delivers; this is whatever is
specified with GET during the protocol exchange with the client. If
the uri is "full" then the whole struct is properly filled in (BTW,
the apr_uri_t is zero'd out with memset in apr_uri_parse).
Observations
============
I wrote a handler to return, as text/plain, the setting of various req
members of interest to this discussion. I set up apache to run on a
non-default port and required basic auth to access the page so the
full uri will be parsed (theoretically).
When I type the following into my browser (firefox):
http://foo:[EMAIL PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here
Here's the output:
req.hostname: localhost
req.unparsed_uri: /~dpopowich/py/parsed?a=b&c=d
req.parsed_uri: (None, None, None, None, None, 8000,
'/~dpopowich/py/parsed', 'a=b&c=d', None)
req.uri: /~dpopowich/py/parsed
req.args: a=b&c=d
It appears only "/PATH?QUERY" has been passed to the server and I
confirmed this by sniffing the packets. It's interesting that the
port is set and hostname is not...I think this has to do with some
code in the virtual host handling.
Here's the output with a verbose call with curl (same uri as above):
* About to connect() to localhost port 8000
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 8000
* Server auth using Basic with user 'foo'
> GET /~dpopowich/py/parsed?a=b&c=d#here HTTP/1.1
> Authorization: Basic Zm9vOmJhcg==
> User-Agent: curl/7.15.0 (i486-pc-linux-gnu) libcurl/7.15.0 OpenSSL/0.9.8a
zlib/1.2.3 libidn/0.5.18
> Host: localhost:8000
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 30 Nov 2005 15:43:19 GMT
< Server: Apache/2.0.54 (Debian GNU/Linux) mod_python/3.2.5b Python/2.3.5
mod_perl/2.0.1 Perl/v5.8.7
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/plain
req.hostname: localhost
req.unparsed_uri: /~dpopowich/py/parsed?a=b&c=d#here
req.parsed_uri: (None, None, None, None, None, 8000,
'/~dpopowich/py/parsed', 'a=b&c=d', 'here')
req.uri: /~dpopowich/py/parsed
req.args: a=b&c=d
* Closing connection #0
Notice how "/PATH?QUERY#FRAGMENT" is passed with this client.
Now if I type the following into a telnet session (telnet localhost 8000):
GET http://foo:[EMAIL PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here
HTTP/1.1
Authorization: Basic Zm9vOmJhcg==
Host: localhost:8000
Then the output is:
req.hostname: localhost
req.unparsed_uri: http://foo:[EMAIL
PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here
req.parsed_uri: ('http', 'foo:[EMAIL PROTECTED]:8000', 'foo', 'bar',
'localhost', 8000, '/~dpopowich/py/parsed', 'a=b&c=d', 'here')
req.uri: /~dpopowich/py/parsed
req.args: a=b&c=d
Summary
=======
o req.hostname is set by the contents of the full URI, or in absence
of a full uri, the value of the Host header (this is what is
actually said in the mod_python docs). As mentioned before, in the
case when HTTP/1.1 AND the full URI are not specified, req.hostname
can be None.
o req.unparsed_uri is set to the uri specified with GET
o req.parsed_uri is the parsing of req.unparsed_uri (although the
port may appear even if it's not in req.unparsed_uri and if it's
not 80). Definitely there's inconsistencies in how apache handles
this struct. A bug? Maybe not, but some cleanup of the code with
regards to this struct would be nice.
o req.uri is set to req.parsed_uri.path
o req.args is set to req.parsed_uri.query
o When a full URI is specified with GET, the values of hostname and
port can be bogus, i.e., the values in parsed_uri will be set to
whatever the uri specifies, but this may not be the host or port
the client actually connected to. While not explicitly a security
risk, poor programming based on these values could lead to one,
IMHO.
Therefore, I think we're stuck. There's no way we can guarantee
browsers will pass full URIs and none seem to do so. I agree with
Grisha:
o get interfaces to apache functions that return the actual
connection attributes.
Also:
o since you can't rely on any of the hostinfo specified with GET
being valid, apps should rely on hard-coded values in
configuration files to build full URIs. E.g., you know your app
is rooted at http://somehost:someport/, put it as a string in
configuration module that can be imported, then append to it with
your PATH&QUERY. Forcing redirects to the "proper" host in your
apache configurations is probably good practice as well.
o if you're using virtual hosts and your app is not running in the
default virtual host, then (I believe) you're forcing clients to
be speaking HTTP/1.1 in which case req.hostname is guaranteed to
be set, right? You might be able to build strings off of that,
but then your app is dependent on the vagaries of your apache
configuration and one miscalculated cut&paste, placing your
virtualhost first, may lead to weirdness.
OK...long enough...ttfn,
Daniel Popowich
---------------
http://home.comcast.net/~d.popowich/mpservlets/