Re: http authentication improvement

John-Mark Bell Sun, 02 Aug 2009 19:17:14 -0700

On Mon, 2009-08-03 at 00:32 +0100, John Tytgat wrote:
> I'm trying to get
> <URL:http://sourceforge.net/tracker/?func=detail&aid=2830829&group_id=51719&atid=464312>
> fixed and I'm seeking some feedback on what I have.


Thanks for taking a look at this.

> What happens: a https://example.com/foo/bar/doh/leaf.html gets fetched
> after the user supplied http authentication.  That pages refers to several
> stylesheets which are all located under https://example.com/ but not in
> the foo/bar/doh directory.  Currently all those fetches for those
> stylesheets are resulting in a 401 as in
> fetch_curl.c(fetch_curl_set_options) the urldb_get_auth_details() does not
> return authentication which the user supplied for leaf.html.  The 401
> results in the FETCH_AUTH case in fetchcache.c(fetchcache_callback) and
> that's the end for the stylesheet parsing as we'll tumble into
> html_convert_css_callback()'s case CONTENT_MSG_AUTH.

Yeah. It's somewhat miraculous that the current auth implementation
actually works.

> Aside, what's the CONTENT_MSG_AUTH case in css.c(nscss_import) ?

Same thing, but for stylesheets retrieved through @import. There's a
bunch more instances of similar switch statements sprinkled through the
sources.

> The urldb_get_auth_details() fails because urldb_set_auth_details() gets
> done for the path_data of https://example.com/foo/bar/doh/ (the leafname
> of the URL gets stripped) so all auth retrieving fails for shorter paths.

Yup. This is wrong.

> I've been reading http://tools.ietf.org/html/rfc2617 and this learns me
> that the http authentication details (username & password) are associated
> with a protection space and the latter is uniquely defined by a canonical
> root URL and a realm given in a 401 response for a certain URL belonging
> to that space. All paths below that URL belong to the same protection
> space.
> 
> It could, but does not need so, that the paths higher than that URL do
> also belong to the same protection space but that's something you can not
> know until you try to fetch them without authentication and get a 401 with
> the same realm (assuming same canonical root here).

A protection space is defined as the tuple:

{ canonical_root_url, realm }

Thus, all URLs which prefix match the canonical_root_url may be in a
given protection space.

The canonical root URL is defined as being an absoluteURI with an empty
abs_path. It's not entirely clear to me what the impact of query
components is. My inference is that they don't appear in canonical root
URLs.

So, given an URL:

scheme://authority/foo/bar/baz/

The canonical root would be:

scheme://authority

where authority contains the host/IP address and port number.

Determining if a given URL is a member of a protection space requires
realm data. This isn't available before a fetch. Therefore, until you
receive a 401 response, you can't reliably send authentication details.

However, you could speculatively send authentication details for an
initial fetch if you have some for the canonical root URL. If the
details aren't acceptable, then you'll get a 401 response as normal.

> I've changed the urldb data structure a bit to better reflect the
> protection space concept (see patch below) : in the struct host_part we
> now have a linked list of struct prot_space_data structures, i.e. one per
> protection space we've seen for that host (apart from its scheme and port
> because that info is found in each path_data).  That structure also
> contains the identifying realm and associated authentication data.  Each
> struct path_data can have a pointer to one of prot_space_data structures
> and this only when we know for sure it belongs to that pointed protection
> space.

I think this is probably too complex. I can't immediately remember what
happens within urldb when there's multiple port numbers involved (or,
indeed, where the path data for empty path is stored).

ISTR there's something nasty like

               host
                |
    +-----------+-----------+
    |           |           |
{"",port1} {"",port2} {"foo",port1}
                            |
                       {"",port1}

I.E. The path data for scheme://host:port1/foo/ is _not_ a subtree of
the path data for scheme://host:port1/. You can envisage the
parent-child link between tree nodes as modelling a '/' in the URL's
path component.

The obvious place for authentication data is on the path_data node for
the canonical root URL. This cannot be a host_data node, as the
canonical root URL includes scheme and port details. 

Given the somewhat unhelpful path_data tree structure, searching for
authentication details for a URL is somewhat awkward (though no more
awkward than searching for any other parent path related data in urldb).

That said, by storing the authentication data on the path_data node for
the canonical root URL, you reduce the fragility of the data structure.

Given the above, urldb_set_auth_details would do something like the
following:

urldb_set_auth_details(url, realm, details):

  # Find canonical root
  canonical-root := canonical_root(url);

  # Find path data for canonical root
  path-data := find_path_data(canonical-root);

  # Find realm data on path, if any
  auth-tuple := find_auth_data(path-data, realm);

  # Create realm data on path, if none
  if (!auth-tuple):
    auth-tuple = new_auth_tuple_for_path(path-data, realm);

  # Set authentication details.
  auth-tuple.details = details;


Then urldb_get_auth_details would do something like:

urldb_get_auth_details(url, realm):

  # Find canonical root
  canonical-root := canonical_root(url);

  # Find path data for canonical root
  path-data := find_path_data(canonical-root);

  # Find realm data on path, if any
  if (!realm):
    # Speculate!
    auth-tuple := path-data.first_auth;
  else:
    auth-tuple := find_auth_data(path-data, realm);

  return auth-tuple;

That is, if the client doesn't provide a realm, it speculatively gives
you the realm data for the first available realm, if any.

> In that case I believe we should let urldb_get_auth_details() fail even
> when the host_part has one or more prot_space_data structures associated
> with and await a 401 which gives us the right realm so that we can make
> the right choice for selecting prot_space_data or if it is a new realm,
> offer the user the oppertunity to enter an username and password.  Note
> it might very well be that we don't get a 401 at all as such an URL might
> not need authentication.

I think it's worth speculatively sending authentication details if we
have some -- not least, it saves a request/response cycle if the
speculation is correct (which it will be in the common case of a single
protection space per canonical root).

> I've tried to implement that but failed miserable, basically I wanted to
> do a browser_window_reload() or something similar in the CONTENT_MSG_AUTH
> case of browser.c(browser_window_callback) when
> urldb_get_auth_details(c->url, data.auth_realm) returns something non-NULL
> (btw, which would at that point have filled in the prot_space_data of
> lowest path_data so that at fetch_curl_set_options() time,
> urldb_get_auth_details(f->url, NULL) returns the same data even with
> realm param NULL).

I think this is the wrong place to be handling this kind of thing. It
should be treated in the same way that redirects are. That is,
transparent to the fetchcache client unless some action is required.

So, for authentication (and assuming that urldb's auth data code is
changed), we'd want fetchcache to do something like:

if (http_code == 401):
   # Look for auth data for this protection space
   auth-data := urldb_get_auth_data(fetch_url, realm_from_http_header);

   if (auth-data && !fetch.tried_with_auth):
     # We've got some auth data and we've not yet tried fetching with it
     # Flag that we've tried fetching with auth data
     fetch.tried_with_auth := true;
     # Retry fetch, with auth data
     retry_fetch(auth-data);
   else:
     # No authentication data, or we've already tried using what we had.
     # Clear flag, so we don't end up here when the client retries
     fetch.tried_with_auth := false;
     # Ask client to sort out the mess.
     fetchcache_callback(CONTENT_MSG_AUTH, ...);

That is, the client of fetchcache is only ever told about 401 stuff when
no suitable authentication data is available. In this case, the user
needs to be asked for it.

> But even if that would somehow work, it wouldn't for loaded stylesheets
> as something similar should happen in the CONTENT_MSG_AUTH case in
> html_convert_css_callback(), right ? I see other CONTENT_MSG_AUTH cases
> (like css.c(nscss_import) so probably even more needs to be done.

Correct. They all need to bubble the message up to their parent
contents. Once it's reached the top-level content, the user will be
asked for details. 

I suspect, however, that, for embedded content, there's then no way to
actually restart the fetch with the authentication details provided.
This is problematic (and most likely why the existing code does
precisely nothing for authentication requests for embedded objects).

> So I went for the quick hack in urldb_get_auth_details() which picks the
> first prot_space_data for given host if available for the case where non
> of the path_data structs of given URL has a ptr to prot_space_data.  It
> works for my sf.net case mentioned above but it will result in
> authentication data given for URLs which do not require them, or can
> give wrong authentication (because of wrongly chosen prot_space_data).

Indeed.

> Now the questions:
> 
> 1. Is this hack acceptable for the time being ?
> 2. Or should we shoot for, what I think is, the right approach, i.e.
>    when no prot_space_data ptr can be found in path_date structs, await
>    a 401 and with the realm given at that point take the right
>    prot_space_data or offer the uses the possibility to enter a new
>    username/password for that realm.  But that needs to work for all
>    content fetched, not only html (what I think is happening).
>    TBH, this is probably above my head to implement for the time I can
>    spend on.

I'd prefer (2), which is why there's been a couple of related
authentication bugs sat in the tracker since February.

Does the design I outlined above make sense to you?

> Aside: path_data::port seems to be 0 for URLs where the port is not
> specified, is that intended ?

Yes. A port of 0 implies the default port for the scheme. So port 80 for
http, and 443 for https. The reason for using 0 in this way is that,
otherwise, you have no ability to distinguish the case where the port
was implicit or explicitly defined. This matters for providing the user
with a consistent view of URLs they navigate to.

> Aside: the riscos/401login.c change allows to keep the username and
> password in the 401login dialogue after a 2nd 401 (for the same realm
> and canonical root).  If this is acceptable, other front-end can do
> a similar thing.

That sounds like a nice thing to have. That said, given my suggested
approach, the only time the user will be asked is if authentication has
already failed completely. In that case, it may not be useful to
pre-populate the dialogue fields.


J.

Re: http authentication improvement

Reply via email to