Nicola Ken wrote:
> So, from quick glance, it seems that the way it's done is IMHO the
> right way.

Glad you think so!

> Upayavira wrote in bugzilla:
> "
> This code appears to try to check pages that begin with #, javascript:
> or http://. I plan to prevent this, and probably sort other things
> too, but I'd like to see what people think of this code before I do
> anything else. "

> Could you please explain it a bit more, and the changes you'd like to
> make. Especially, is this behaviour different from the previous one?

I ran this on a site I built some time ago (with nasty things like Javascript: links, 
and 
got files generated for:

#US
#nonUS
http_
javascript_form.submit()
[EMAIL PROTECTED]
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http_\www.magamall.com=client
javascript_form.submit()

None of which should have been generated. I would therefore ignore links that begin 
with #, javascript: or mailto:. The controversial one I presume is ignoring links that 
begin with http://. We could get around this by adding a configuration parameter that 
specifies the name of the server that the site is based upon. So, when generating the 
Cocoon site, we could specify that URIs that begin with http://xml.apache.org/cocoon 
should be spidered, but references to (for example) http://www.w3.org should be 
ignored. By default, I'd just ignore any links that begin with http://.

I haven't tried this using the old behaviour. I will, and will let you know.

> Also, have you yet measured the speed increases?

I haven't measured it. I'll do that and report back. I'll add some code to report the 
time 
taken to generate the site (much like the build script).

> Is it possible to also have the same 3 step behaviour there was
> before?

Yes. I've left the original behaviour as the default. All other behaviours can be 
configured in the xconf file.

Regards, Upayavira

Reply via email to