Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Colm MacCarthaigh
On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:
 The idea of canonicalising the name is sound, but munging them into an 
 added :80 and an added ? is really ugly - these are not the kind of URLs 
 that an end user would understand at a glance if they had to see them 
 listed.

An end-user should never see these keys, the only place they are visible
to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted. 

 Is it possible to remove the :80 if the scheme is https, and remove the 
 :443 if the scheme is https:? What is the significance of the added ??

The ? isn't me, that's current mod_cache behaviour, so I left it
alone.

It doesn't have any significance except for avoiding an extra condition.
r-args is part of the key aswell, it just happens to have been NULL in
those examples.

Either way, doing as you suggest is trivial, but is there really a point
adding more conditions? Any tool which does inspect the cache files can
clean it up for presentation to the administrator.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Jim Jagielski


On Aug 15, 2005, at 4:10 AM, Colm MacCarthaigh wrote:


On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:

The idea of canonicalising the name is sound, but munging them  
into an
added :80 and an added ? is really ugly - these are not the kind  
of URLs

that an end user would understand at a glance if they had to see them
listed.



An end-user should never see these keys, the only place they are  
visible

to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted.


Is it possible to remove the :80 if the scheme is https, and  
remove the
:443 if the scheme is https:? What is the significance of the  
added ??




The ? isn't me, that's current mod_cache behaviour, so I left it
alone.

It doesn't have any significance except for avoiding an extra  
condition.
r-args is part of the key aswell, it just happens to have been  
NULL in

those examples.

Either way, doing as you suggest is trivial, but is there really a  
point
adding more conditions? Any tool which does inspect the cache files  
can

clean it up for presentation to the administrator.



I think Colm has a valid point... since these are internal  
representations

then the cleaning up would best be done by the actual view process.
I would imagine that keeping the internal representations consistent
would streamline the actual functional aspects.


Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Graham Leggett

Colm MacCarthaigh wrote:


An end-user should never see these keys, the only place they are visible
to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted. 


Ok cool - I was not sure whether the URL list would be mined for data 
for any reason.



Either way, doing as you suggest is trivial, but is there really a point
adding more conditions? Any tool which does inspect the cache files can
clean it up for presentation to the administrator.


In that case you're right - the less work done on the URL, the faster it 
will be.


Regards,
Graham
--


Re: [PATCH] Make caching hash more deterministic

2005-08-14 Thread Graham Leggett

Colm MacCarthaigh wrote:


Currently;

GET / HTTP/1.1
Host: ftp.heanet.ie

GET http://ftp.heanet.ie/ HTTP/1.0

GET HTTP://Ftp.Heanet.Ie/ HTTP/1.0

are all mapped to different hashes by mod_cache; despite being the same
content, this is an inefficient waste of disk space and really awkward
for me trying to write a debug/admin tool.

The attached patch makes it deterministic, by mapping them all to;

	http://ftp.heanet.ie:80/?; 


The idea of canonicalising the name is sound, but munging them into an 
added :80 and an added ? is really ugly - these are not the kind of URLs 
that an end user would understand at a glance if they had to see them 
listed.


Is it possible to remove the :80 if the scheme is https, and remove the 
:443 if the scheme is https:? What is the significance of the added ??


Regards,
Graham
--




Re: [PATCH] Make caching hash more deterministic

2005-08-13 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 01:34:50PM -0400, Jim Jagielski wrote:
 Here's a more involved patch that gets the logic right. It's 6pm on a
 Friday for me, so I have only tested it a little, but thought I'd  
 share
 for comment before the weekend.
 
 
 +1 on inspection... testing to be done over
 the weekend :)

Of course :) I've run http local and proxy cases, and ftp proxy cases,
as well as a few odd things now. With UseCanonicalName on, it does
improve the hitrates. 

I've changed the patch a little (attached) but only some cosmetic
comment changes, and I ditched the local:// uri;

Thinking about it, for a cache to be shared amongst protocols, things
like the connection port would have to be faked anyway. So might aswell
include the real serving protocol - makes much more sense to
administrators.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Index: modules/cache/cache_storage.c
===
--- modules/cache/cache_storage.c   (revision 232437)
+++ modules/cache/cache_storage.c   (working copy)
@@ -320,12 +320,82 @@
 apr_status_t cache_generate_key_default(request_rec *r, apr_pool_t* p,
 char**key)
 {
-if (r-hostname) {
-*key = apr_pstrcat(p, r-hostname, r-uri, ?, r-args, NULL);
+char *port_str, *scheme, *hn;
+const char * hostname;
+int i;
+
+/* Use the canonical name to improve cache hit rate, but only if this is
+ * not a proxy request. 
+ */ 
+if (!r-proxyreq) {
+/* Use _default_ as the hostname if none present, as in mod_vhost */
+hostname =  ap_get_server_name(r);
+if (!hostname) {
+hostname = _default_;
+}
 }
+else if(r-parsed_uri.hostname) {
+/* Copy the parsed uri hostname */
+hn = apr_pcalloc(p, strlen(r-parsed_uri.hostname) + 1);
+for (i = 0; r-parsed_uri.hostname[i]; i++) {
+hn[i] = apr_tolower(r-parsed_uri.hostname[i]);
+}
+/* const work-around */
+hostname = hn;
+}
 else {
-*key = apr_pstrcat(p, r-uri, ?, r-args, NULL);
+/* We are a proxied request, with no hostname. Unlikely
+ * to get very far - but just in case */
+hostname = _default_;
 }
+  
+/* Copy the scheme, ensuring that it is lower case. If the parsed uri
+ * contains no string or if this is not a proxy request.
+ */
+if (r-proxyreq  r-parsed_uri.scheme) {
+/* Copy the scheme */
+scheme = apr_pcalloc(p, strlen(r-parsed_uri.scheme) + 1);
+for (i = 0; r-parsed_uri.scheme[i]; i++) {
+scheme[i] = apr_tolower(r-parsed_uri.scheme[i]);
+}
+}
+else {
+scheme = http;
+}
+
+/* If the content is locally generated, use the port-number of the
+ * current server. Otherwise. copy the URI's port-string (which may be a
+ * service name). If the URI contains no port-string, use apr-util's
+ * notion of the default port for that scheme - if available.
+ */
+if(r-proxyreq) {
+if (r-parsed_uri.port_str) {
+port_str = apr_pcalloc(p, strlen(r-parsed_uri.port_str) + 2);
+port_str[0] = ':';
+for (i = 0; r-parsed_uri.port_str[i]; i++) {
+port_str[i + 1] = apr_tolower(r-parsed_uri.port_str[i]);
+}
+}
+else if (apr_uri_port_of_scheme(scheme)) {
+port_str = apr_psprintf(p, :%u, apr_uri_port_of_scheme(scheme));
+}
+else {
+/* No port string given in the AbsoluteUri, and we have no
+ * idea what the default port for the scheme is. Leave it
+ * blank and live with the inefficiency of some extra cached
+ * entities.
+ */
+port_str = ;
+}   
+}   
+else {
+/* Use the server port */
+port_str = apr_psprintf(p, :%u, ap_get_server_port(r));
+}
+
+/* Key format is a URI */
+*key = apr_pstrcat(p, scheme, ://, hostname, port_str,
+   r-parsed_uri.path, ?, r-args, NULL);
+
 return APR_SUCCESS;
 }
-


[PATCH] Make caching hash more deterministic

2005-08-12 Thread Colm MacCarthaigh

Currently;

GET / HTTP/1.1
Host: ftp.heanet.ie

GET http://ftp.heanet.ie/ HTTP/1.0

GET HTTP://Ftp.Heanet.Ie/ HTTP/1.0

are all mapped to different hashes by mod_cache; despite being the same
content, this is an inefficient waste of disk space and really awkward
for me trying to write a debug/admin tool.

The attached patch makes it deterministic, by mapping them all to;

http://ftp.heanet.ie:80/?; 

Instead of ftp.heanet.ie/?. For for a cached webserver, this really
won't make much of a difference since the Host-header is forcably
lower-cased anyway, but for a proxy it definitely helps.  Looking
through my logs I'm seeing lots of simple domain case variations - no
point storing them twice and handling all of the expires multiple times.

It also solves the colision that happens if aan administrator wants to
run Apache listening on multiple ports, but has mod_cache enabled. 

The only awkwardness I can see with this approach, is that;

GET / HTTP/1.0

would look like this;

http://:80/?;

So, I've re-used the _default_ convention (underscores are not
permitted in DNS anyway) for such keys;

http://_default_:80/?;

Which should at least make a familiar sort of sence to an administrator.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Index: modules/cache/cache_storage.c
===
--- modules/cache/cache_storage.c   (revision 232304)
+++ modules/cache/cache_storage.c   (working copy)
@@ -318,12 +318,46 @@
 apr_status_t cache_generate_key_default(request_rec *r, apr_pool_t* p,
 char**key)
 {
-if (r-hostname) {
-*key = apr_pstrcat(p, r-hostname, r-uri, ?, r-args, NULL);
+const char *hostname;
+char *port_str, *scheme;
+int i;
+
+/* Use _default_ as the hostname if none present, as in mod_vhost
+ * Note: r-hostname is always lowercase
+ */
+hostname = r-hostname ? r-hostname : _default_;
+  
+/* Copy the scheme, ensuring that it is lower case. If the parsed uri
+ * contains no string, we use http as the default. This is a fair
+ * assumption, as request_rec is HTTP-specific.
+ */
+if (r-parsed_uri.scheme) {
+scheme = apr_pcalloc(p, strlen(r-parsed_uri.scheme) + 1);
+for (i = 0; r-parsed_uri.scheme[i]; i++) {
+scheme[i] = apr_tolower(r-parsed_uri.scheme[i]);
+}
 }
 else {
-*key = apr_pstrcat(p, r-uri, ?, r-args, NULL);
+scheme = http;
 }
+
+/* Copy the port string, ensuring that it is lower case (it may be a
+ * service name. If not present, use the connection to determine port
+ * number
+ */
+if (r-parsed_uri.port_str) {
+port_str = apr_pcalloc(p, strlen(r-parsed_uri.port_str) + 1);
+for (i = 0; r-parsed_uri.port_str[i]; i++) {
+port_str[i] = apr_tolower(r-parsed_uri.port_str[i]);
+}
+}
+else {
+port_str = apr_psprintf(p, %u, ap_get_server_port(r));
+}
+
+/* Key format is a URI */
+*key = apr_pstrcat(p, scheme, ://, hostname, :, port_str,
+   r-parsed_uri.path, ?, r-args, NULL);
+
 return APR_SUCCESS;
 }
-


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Brian Akins
Should this honor usecanonicalname?  If so, could just use 
ap_get_servername(r) in stead of r-hostname.  This may further compact 
the number of entries.




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
 Should this honor usecanonicalname?  If so, could just use 
 ap_get_servername(r) in stead of r-hostname.  This may further compact 
 the number of entries.

Yes, but I think there'd have to be additional code to detect the proxy
cases. And you pointing that out has just reminded me of a bug in my
patch - it doesn't work for;

GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0

I'll go make that work too.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:
 On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
  Should this honor usecanonicalname?  If so, could just use 
  ap_get_servername(r) in stead of r-hostname.  This may further compact 
  the number of entries.
 
 Yes, but I think there'd have to be additional code to detect the proxy
 cases. And you pointing that out has just reminded me of a bug in my
 patch - it doesn't work for;
 
   GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0
 
 I'll go make that work too.

Here's a more involved patch that gets the logic right. It's 6pm on a
Friday for me, so I have only tested it a little, but thought I'd share
for comment before the weekend.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Index: modules/cache/cache_storage.c
===
--- modules/cache/cache_storage.c   (revision 232304)
+++ modules/cache/cache_storage.c   (working copy)
@@ -318,12 +318,90 @@
 apr_status_t cache_generate_key_default(request_rec *r, apr_pool_t* p,
 char**key)
 {
-if (r-hostname) {
-*key = apr_pstrcat(p, r-hostname, r-uri, ?, r-args, NULL);
+const char *hostname;
+char *port_str, *scheme, *hn;
+int i;
+
+/* Use the canonical name to improve cache hit rate, but only if this is
+ * not a proxy request. 
+ */ 
+if (!r-proxyreq) {
+/* Use _default_ as the hostname if none present, as in mod_vhost
+ */
+hostname = ap_get_server_name(r);
+if (!hostname) {
+hostname = _default_;
+}
 }
+else if(r-parsed_uri.hostname) {
+/* Copy the parsed uri hostname */
+hn = apr_pcalloc(p, strlen(r-parsed_uri.hostname) + 1);
+for (i = 0; r-parsed_uri.hostname[i]; i++) {
+hn[i] = apr_tolower(r-parsed_uri.hostname[i]);
+}
+
+/* const work-around */
+hostname = hn;
+}
 else {
-*key = apr_pstrcat(p, r-uri, ?, r-args, NULL);
+/* We are a proxied request, with no hostname. Unlikely
+ * to get very far - but just in case */
+hostname = _default_;
 }
+  
+/* Copy the scheme, ensuring that it is lower case. If the parsed uri
+ * contains no string or if this is not a proxy request, we use local as
+ * the default. 
+ *
+ * Why local? Ans: to indicate that the content is locally generated, and
+ * because  Apache can serve multiple protocols, lets not get tied to a
+ * single one. This way a mod_[ftp|bittorrent|foobar] front-end can share
+ * our content cache.
+ */
+if (r-proxyreq  r-parsed_uri.scheme) {
+/* Copy the scheme */
+scheme = apr_pcalloc(p, strlen(r-parsed_uri.scheme) + 1);
+for (i = 0; r-parsed_uri.scheme[i]; i++) {
+scheme[i] = apr_tolower(r-parsed_uri.scheme[i]);
+}
+}
+else {
+scheme = local;
+}
+
+/* If the content is locally generated, use the port-number of the
+ * current server. Otherwise. copy the URI's port-string (which may be a
+ * service name). If the URI contains no port-string, use apr-util's
+ * notion of the default port for that scheme - if available.
+ */
+if(r-proxyreq) {
+if (r-parsed_uri.port_str) {
+port_str = apr_pcalloc(p, strlen(r-parsed_uri.port_str) + 2);
+port_str[0] = ':';
+for (i = 0; r-parsed_uri.port_str[i]; i++) {
+port_str[i + 1] = apr_tolower(r-parsed_uri.port_str[i]);
+}
+}
+else if (apr_uri_port_of_scheme(scheme)) {
+port_str = apr_psprintf(p, :%u, apr_uri_port_of_scheme(scheme));
+}
+else {
+/* No port string given in the AbsoluteUri, and we have no
+ * idea what the default port for the scheme is. Leave it
+ * blank and live with the inefficiency of some extra cached
+ * entities.
+ */
+port_str = ;
+}   
+}   
+else {
+/* Use the server port */
+port_str = apr_psprintf(p, :%u, ap_get_server_port(r));
+}
+
+/* Key format is a URI */
+*key = apr_pstrcat(p, scheme, ://, hostname, port_str,
+   r-parsed_uri.path, ?, r-args, NULL);
+
 return APR_SUCCESS;
 }
-


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Jim Jagielski


On Aug 12, 2005, at 1:10 PM, Colm MacCarthaigh wrote:


On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:


On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:


Should this honor usecanonicalname?  If so, could just use
ap_get_servername(r) in stead of r-hostname.  This may further  
compact

the number of entries.



Yes, but I think there'd have to be additional code to detect the  
proxy

cases. And you pointing that out has just reminded me of a bug in my
patch - it doesn't work for;

GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0

I'll go make that work too.



Here's a more involved patch that gets the logic right. It's 6pm on a
Friday for me, so I have only tested it a little, but thought I'd  
share

for comment before the weekend.



+1 on inspection... testing to be done over
the weekend :)