Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Colm MacCarthaigh
On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:
 The idea of canonicalising the name is sound, but munging them into an 
 added :80 and an added ? is really ugly - these are not the kind of URLs 
 that an end user would understand at a glance if they had to see them 
 listed.

An end-user should never see these keys, the only place they are visible
to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted. 

 Is it possible to remove the :80 if the scheme is https, and remove the 
 :443 if the scheme is https:? What is the significance of the added ??

The ? isn't me, that's current mod_cache behaviour, so I left it
alone.

It doesn't have any significance except for avoiding an extra condition.
r-args is part of the key aswell, it just happens to have been NULL in
those examples.

Either way, doing as you suggest is trivial, but is there really a point
adding more conditions? Any tool which does inspect the cache files can
clean it up for presentation to the administrator.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Jim Jagielski


On Aug 15, 2005, at 4:10 AM, Colm MacCarthaigh wrote:


On Sat, Aug 13, 2005 at 10:29:54AM +0200, Graham Leggett wrote:

The idea of canonicalising the name is sound, but munging them  
into an
added :80 and an added ? is really ugly - these are not the kind  
of URLs

that an end user would understand at a glance if they had to see them
listed.



An end-user should never see these keys, the only place they are  
visible

to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted.


Is it possible to remove the :80 if the scheme is https, and  
remove the
:443 if the scheme is https:? What is the significance of the  
added ??




The ? isn't me, that's current mod_cache behaviour, so I left it
alone.

It doesn't have any significance except for avoiding an extra  
condition.
r-args is part of the key aswell, it just happens to have been  
NULL in

those examples.

Either way, doing as you suggest is trivial, but is there really a  
point
adding more conditions? Any tool which does inspect the cache files  
can

clean it up for presentation to the administrator.



I think Colm has a valid point... since these are internal  
representations

then the cleaning up would best be done by the actual view process.
I would imagine that keeping the internal representations consistent
would streamline the actual functional aspects.


Re: [PATCH] Make caching hash more deterministic

2005-08-15 Thread Graham Leggett

Colm MacCarthaigh wrote:


An end-user should never see these keys, the only place they are visible
to any user is the semi-binary mod_disk_cache files. An administrator
would have to really know what they're doing to find them, or be using
htcacheadmin - once I finish that, and if it gets accepted. 


Ok cool - I was not sure whether the URL list would be mined for data 
for any reason.



Either way, doing as you suggest is trivial, but is there really a point
adding more conditions? Any tool which does inspect the cache files can
clean it up for presentation to the administrator.


In that case you're right - the less work done on the URL, the faster it 
will be.


Regards,
Graham
--


Re: [PATCH] Make caching hash more deterministic

2005-08-14 Thread Graham Leggett

Colm MacCarthaigh wrote:


Currently;

GET / HTTP/1.1
Host: ftp.heanet.ie

GET http://ftp.heanet.ie/ HTTP/1.0

GET HTTP://Ftp.Heanet.Ie/ HTTP/1.0

are all mapped to different hashes by mod_cache; despite being the same
content, this is an inefficient waste of disk space and really awkward
for me trying to write a debug/admin tool.

The attached patch makes it deterministic, by mapping them all to;

	http://ftp.heanet.ie:80/?; 


The idea of canonicalising the name is sound, but munging them into an 
added :80 and an added ? is really ugly - these are not the kind of URLs 
that an end user would understand at a glance if they had to see them 
listed.


Is it possible to remove the :80 if the scheme is https, and remove the 
:443 if the scheme is https:? What is the significance of the added ??


Regards,
Graham
--




Re: [PATCH] Make caching hash more deterministic

2005-08-13 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 01:34:50PM -0400, Jim Jagielski wrote:
 Here's a more involved patch that gets the logic right. It's 6pm on a
 Friday for me, so I have only tested it a little, but thought I'd  
 share
 for comment before the weekend.
 
 
 +1 on inspection... testing to be done over
 the weekend :)

Of course :) I've run http local and proxy cases, and ftp proxy cases,
as well as a few odd things now. With UseCanonicalName on, it does
improve the hitrates. 

I've changed the patch a little (attached) but only some cosmetic
comment changes, and I ditched the local:// uri;

Thinking about it, for a cache to be shared amongst protocols, things
like the connection port would have to be faked anyway. So might aswell
include the real serving protocol - makes much more sense to
administrators.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Index: modules/cache/cache_storage.c
===
--- modules/cache/cache_storage.c   (revision 232437)
+++ modules/cache/cache_storage.c   (working copy)
@@ -320,12 +320,82 @@
 apr_status_t cache_generate_key_default(request_rec *r, apr_pool_t* p,
 char**key)
 {
-if (r-hostname) {
-*key = apr_pstrcat(p, r-hostname, r-uri, ?, r-args, NULL);
+char *port_str, *scheme, *hn;
+const char * hostname;
+int i;
+
+/* Use the canonical name to improve cache hit rate, but only if this is
+ * not a proxy request. 
+ */ 
+if (!r-proxyreq) {
+/* Use _default_ as the hostname if none present, as in mod_vhost */
+hostname =  ap_get_server_name(r);
+if (!hostname) {
+hostname = _default_;
+}
 }
+else if(r-parsed_uri.hostname) {
+/* Copy the parsed uri hostname */
+hn = apr_pcalloc(p, strlen(r-parsed_uri.hostname) + 1);
+for (i = 0; r-parsed_uri.hostname[i]; i++) {
+hn[i] = apr_tolower(r-parsed_uri.hostname[i]);
+}
+/* const work-around */
+hostname = hn;
+}
 else {
-*key = apr_pstrcat(p, r-uri, ?, r-args, NULL);
+/* We are a proxied request, with no hostname. Unlikely
+ * to get very far - but just in case */
+hostname = _default_;
 }
+  
+/* Copy the scheme, ensuring that it is lower case. If the parsed uri
+ * contains no string or if this is not a proxy request.
+ */
+if (r-proxyreq  r-parsed_uri.scheme) {
+/* Copy the scheme */
+scheme = apr_pcalloc(p, strlen(r-parsed_uri.scheme) + 1);
+for (i = 0; r-parsed_uri.scheme[i]; i++) {
+scheme[i] = apr_tolower(r-parsed_uri.scheme[i]);
+}
+}
+else {
+scheme = http;
+}
+
+/* If the content is locally generated, use the port-number of the
+ * current server. Otherwise. copy the URI's port-string (which may be a
+ * service name). If the URI contains no port-string, use apr-util's
+ * notion of the default port for that scheme - if available.
+ */
+if(r-proxyreq) {
+if (r-parsed_uri.port_str) {
+port_str = apr_pcalloc(p, strlen(r-parsed_uri.port_str) + 2);
+port_str[0] = ':';
+for (i = 0; r-parsed_uri.port_str[i]; i++) {
+port_str[i + 1] = apr_tolower(r-parsed_uri.port_str[i]);
+}
+}
+else if (apr_uri_port_of_scheme(scheme)) {
+port_str = apr_psprintf(p, :%u, apr_uri_port_of_scheme(scheme));
+}
+else {
+/* No port string given in the AbsoluteUri, and we have no
+ * idea what the default port for the scheme is. Leave it
+ * blank and live with the inefficiency of some extra cached
+ * entities.
+ */
+port_str = ;
+}   
+}   
+else {
+/* Use the server port */
+port_str = apr_psprintf(p, :%u, ap_get_server_port(r));
+}
+
+/* Key format is a URI */
+*key = apr_pstrcat(p, scheme, ://, hostname, port_str,
+   r-parsed_uri.path, ?, r-args, NULL);
+
 return APR_SUCCESS;
 }
-


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Brian Akins
Should this honor usecanonicalname?  If so, could just use 
ap_get_servername(r) in stead of r-hostname.  This may further compact 
the number of entries.




--
Brian Akins
Lead Systems Engineer
CNN Internet Technologies


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
 Should this honor usecanonicalname?  If so, could just use 
 ap_get_servername(r) in stead of r-hostname.  This may further compact 
 the number of entries.

Yes, but I think there'd have to be additional code to detect the proxy
cases. And you pointing that out has just reminded me of a bug in my
patch - it doesn't work for;

GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0

I'll go make that work too.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Colm MacCarthaigh
On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:
 On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:
  Should this honor usecanonicalname?  If so, could just use 
  ap_get_servername(r) in stead of r-hostname.  This may further compact 
  the number of entries.
 
 Yes, but I think there'd have to be additional code to detect the proxy
 cases. And you pointing that out has just reminded me of a bug in my
 patch - it doesn't work for;
 
   GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0
 
 I'll go make that work too.

Here's a more involved patch that gets the logic right. It's 6pm on a
Friday for me, so I have only tested it a little, but thought I'd share
for comment before the weekend.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Index: modules/cache/cache_storage.c
===
--- modules/cache/cache_storage.c   (revision 232304)
+++ modules/cache/cache_storage.c   (working copy)
@@ -318,12 +318,90 @@
 apr_status_t cache_generate_key_default(request_rec *r, apr_pool_t* p,
 char**key)
 {
-if (r-hostname) {
-*key = apr_pstrcat(p, r-hostname, r-uri, ?, r-args, NULL);
+const char *hostname;
+char *port_str, *scheme, *hn;
+int i;
+
+/* Use the canonical name to improve cache hit rate, but only if this is
+ * not a proxy request. 
+ */ 
+if (!r-proxyreq) {
+/* Use _default_ as the hostname if none present, as in mod_vhost
+ */
+hostname = ap_get_server_name(r);
+if (!hostname) {
+hostname = _default_;
+}
 }
+else if(r-parsed_uri.hostname) {
+/* Copy the parsed uri hostname */
+hn = apr_pcalloc(p, strlen(r-parsed_uri.hostname) + 1);
+for (i = 0; r-parsed_uri.hostname[i]; i++) {
+hn[i] = apr_tolower(r-parsed_uri.hostname[i]);
+}
+
+/* const work-around */
+hostname = hn;
+}
 else {
-*key = apr_pstrcat(p, r-uri, ?, r-args, NULL);
+/* We are a proxied request, with no hostname. Unlikely
+ * to get very far - but just in case */
+hostname = _default_;
 }
+  
+/* Copy the scheme, ensuring that it is lower case. If the parsed uri
+ * contains no string or if this is not a proxy request, we use local as
+ * the default. 
+ *
+ * Why local? Ans: to indicate that the content is locally generated, and
+ * because  Apache can serve multiple protocols, lets not get tied to a
+ * single one. This way a mod_[ftp|bittorrent|foobar] front-end can share
+ * our content cache.
+ */
+if (r-proxyreq  r-parsed_uri.scheme) {
+/* Copy the scheme */
+scheme = apr_pcalloc(p, strlen(r-parsed_uri.scheme) + 1);
+for (i = 0; r-parsed_uri.scheme[i]; i++) {
+scheme[i] = apr_tolower(r-parsed_uri.scheme[i]);
+}
+}
+else {
+scheme = local;
+}
+
+/* If the content is locally generated, use the port-number of the
+ * current server. Otherwise. copy the URI's port-string (which may be a
+ * service name). If the URI contains no port-string, use apr-util's
+ * notion of the default port for that scheme - if available.
+ */
+if(r-proxyreq) {
+if (r-parsed_uri.port_str) {
+port_str = apr_pcalloc(p, strlen(r-parsed_uri.port_str) + 2);
+port_str[0] = ':';
+for (i = 0; r-parsed_uri.port_str[i]; i++) {
+port_str[i + 1] = apr_tolower(r-parsed_uri.port_str[i]);
+}
+}
+else if (apr_uri_port_of_scheme(scheme)) {
+port_str = apr_psprintf(p, :%u, apr_uri_port_of_scheme(scheme));
+}
+else {
+/* No port string given in the AbsoluteUri, and we have no
+ * idea what the default port for the scheme is. Leave it
+ * blank and live with the inefficiency of some extra cached
+ * entities.
+ */
+port_str = ;
+}   
+}   
+else {
+/* Use the server port */
+port_str = apr_psprintf(p, :%u, ap_get_server_port(r));
+}
+
+/* Key format is a URI */
+*key = apr_pstrcat(p, scheme, ://, hostname, port_str,
+   r-parsed_uri.path, ?, r-args, NULL);
+
 return APR_SUCCESS;
 }
-


Re: [PATCH] Make caching hash more deterministic

2005-08-12 Thread Jim Jagielski


On Aug 12, 2005, at 1:10 PM, Colm MacCarthaigh wrote:


On Fri, Aug 12, 2005 at 04:59:20PM +0100, Colm MacCarthaigh wrote:


On Fri, Aug 12, 2005 at 11:54:44AM -0400, Brian Akins wrote:


Should this honor usecanonicalname?  If so, could just use
ap_get_servername(r) in stead of r-hostname.  This may further  
compact

the number of entries.



Yes, but I think there'd have to be additional code to detect the  
proxy

cases. And you pointing that out has just reminded me of a bug in my
patch - it doesn't work for;

GET ftp://ftp.heanet.ie/pub/heanet/100.txt HTTP/1.0

I'll go make that work too.



Here's a more involved patch that gets the logic right. It's 6pm on a
Friday for me, so I have only tested it a little, but thought I'd  
share

for comment before the weekend.



+1 on inspection... testing to be done over
the weekend :)