> As for the larger issue at hand: the reason req.parsed_uri is not
> filled in is because browsers don't send the info in the GET...

Disclaimer: What follows is not an exhaustive, conclusive search by
tracing running code, but rather searching source code and watching
apache behaviour with tools like curl, telneting to the apache port
and using a browser.

Onward...

As mentioned already, req.parsed_uri is a tuple wrapping of a
request_rec.parsed_uri which is an apr_uri_t.

The contents of this struct are touched in many places, but the
primary functions setting this structure are in
srclib/apr-util/uri/apr_uri.c: apr_uri_parse() and
apr_uri_parse_hostinfo().  Doing a search within apache to see where
these functions are called I discovered a number of modules making use
of these functions, but probably not of concern to this issue.  The
primary caller is ap_parse_uri() in server/protocol.c.

ap_parse_uri() is called numerous times in server/request.c to deal
with sub-requests; it is also called in modules/http/http_request.c
for internal redirects.  The main calling stack which is of concern to
this issue is:

Function Called                 Function defined in
---------------------------------------------------------------
ap_process_http_connection()    [modules/http/http_core.c]
=> ap_read_request()            [server/protocol.c]
   => read_request_line()       [server/protocol.c]
      => ap_parse_uri()         [server/protocol.c]
         => apr_uri_parse()     [srclib/apr-util/uri/apr_uri.c]


ap_parse_uri is called with a request_rec and the uri (as a string);
the string is what read_request_line delivers; this is whatever is
specified with GET during the protocol exchange with the client.  If
the uri is "full" then the whole struct is properly filled in (BTW,
the apr_uri_t is zero'd out with memset in apr_uri_parse).

Observations
============

I wrote a handler to return, as text/plain, the setting of various req
members of interest to this discussion.  I set up apache to run on a
non-default port and required basic auth to access the page so the
full uri will be parsed (theoretically).

When I type the following into my browser (firefox):

   http://foo:[EMAIL PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here

Here's the output:

    req.hostname: localhost
    req.unparsed_uri: /~dpopowich/py/parsed?a=b&c=d
    req.parsed_uri: (None, None, None, None, None, 8000, 
'/~dpopowich/py/parsed', 'a=b&c=d', None)
    req.uri: /~dpopowich/py/parsed
    req.args: a=b&c=d


It appears only "/PATH?QUERY" has been passed to the server and I
confirmed this by sniffing the packets.  It's interesting that the
port is set and hostname is not...I think this has to do with some
code in the virtual host handling.

Here's the output with a verbose call with curl (same uri as above):

    * About to connect() to localhost port 8000
    *   Trying 127.0.0.1... connected
    * Connected to localhost (127.0.0.1) port 8000
    * Server auth using Basic with user 'foo'
    > GET /~dpopowich/py/parsed?a=b&c=d#here HTTP/1.1
    > Authorization: Basic Zm9vOmJhcg==
    > User-Agent: curl/7.15.0 (i486-pc-linux-gnu) libcurl/7.15.0 OpenSSL/0.9.8a 
zlib/1.2.3 libidn/0.5.18
    > Host: localhost:8000
    > Accept: */*
    >
    < HTTP/1.1 200 OK
    < Date: Wed, 30 Nov 2005 15:43:19 GMT
    < Server: Apache/2.0.54 (Debian GNU/Linux) mod_python/3.2.5b Python/2.3.5 
mod_perl/2.0.1 Perl/v5.8.7
    < Connection: close
    < Transfer-Encoding: chunked
    < Content-Type: text/plain
    req.hostname: localhost
    req.unparsed_uri: /~dpopowich/py/parsed?a=b&c=d#here
    req.parsed_uri: (None, None, None, None, None, 8000, 
'/~dpopowich/py/parsed', 'a=b&c=d', 'here')
    req.uri: /~dpopowich/py/parsed
    req.args: a=b&c=d

    * Closing connection #0

Notice how "/PATH?QUERY#FRAGMENT" is passed with this client.

Now if I type the following into a telnet session (telnet localhost 8000):

    GET http://foo:[EMAIL PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here 
HTTP/1.1
    Authorization: Basic Zm9vOmJhcg==
    Host: localhost:8000

Then the output is:

    req.hostname: localhost
    req.unparsed_uri: http://foo:[EMAIL 
PROTECTED]:8000/~dpopowich/py/parsed?a=b&c=d#here
    req.parsed_uri: ('http', 'foo:[EMAIL PROTECTED]:8000', 'foo', 'bar', 
'localhost', 8000, '/~dpopowich/py/parsed', 'a=b&c=d', 'here')
    req.uri: /~dpopowich/py/parsed
    req.args: a=b&c=d


Summary
=======

 o req.hostname is set by the contents of the full URI, or in absence
   of a full uri, the value of the Host header (this is what is
   actually said in the mod_python docs).  As mentioned before, in the
   case when HTTP/1.1 AND the full URI are not specified, req.hostname
   can be None.

 o req.unparsed_uri is set to the uri specified with GET

 o req.parsed_uri is the parsing of req.unparsed_uri (although the
   port may appear even if it's not in req.unparsed_uri and if it's
   not 80).  Definitely there's inconsistencies in how apache handles
   this struct.  A bug?  Maybe not, but some cleanup of the code with
   regards to this struct would be nice.

 o req.uri is set to req.parsed_uri.path

 o req.args is set to req.parsed_uri.query

 o When a full URI is specified with GET, the values of hostname and
   port can be bogus, i.e., the values in parsed_uri will be set to
   whatever the uri specifies, but this may not be the host or port
   the client actually connected to.  While not explicitly a security
   risk, poor programming based on these values could lead to one,
   IMHO.


Therefore, I think we're stuck.  There's no way we can guarantee
browsers will pass full URIs and none seem to do so.  I agree with
Grisha:

   o get interfaces to apache functions that return the actual
     connection attributes.

Also:

   o since you can't rely on any of the hostinfo specified with GET
     being valid, apps should rely on hard-coded values in
     configuration files to build full URIs.  E.g., you know your app
     is rooted at http://somehost:someport/, put it as a string in
     configuration module that can be imported, then append to it with
     your PATH&QUERY.  Forcing redirects to the "proper" host in your
     apache configurations is probably good practice as well.

   o if you're using virtual hosts and your app is not running in the
     default virtual host, then (I believe) you're forcing clients to
     be speaking HTTP/1.1 in which case req.hostname is guaranteed to
     be set, right?  You might be able to build strings off of that,
     but then your app is dependent on the vagaries of your apache
     configuration and one miscalculated cut&paste, placing your
     virtualhost first, may lead to weirdness.


OK...long enough...ttfn,

Daniel Popowich
---------------
http://home.comcast.net/~d.popowich/mpservlets/


Reply via email to