Re: Cocoon CLI: excluding URIs

Upayavira Wed, 27 Aug 2003 19:32:00 +0000

Switching from Forrest-dev...

Jeff Turner wrote (on forrest-dev):

On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
Jeff Turner wrote:
On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
I rebuilt my local Forrest doco today but i get all these strange
error messages about "site:" and "ext:" URLs being broken.
Here is one example...
------
...
* [0] your-project.pdf
X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
* [48] cap.html
...
------
On the other hand, i have a project site that builds with no such problems. So i do not know what is going on. Any clues?
The problem is with the new CLI: we have no way to exclude certain URLs

from being traversed. The Forrest site gives these broken links because
sitemap-ref.xml deliberately references some raw XML (index.xml), which
contains refs to untranslated links like 'site:contrib'.  It just an
annoyance really -- doesn't harm the actual output.
If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
onto the Cocoon CLI so we can do a long-overdue 0.5 release.
Jeff,

Are you saying that the CLI is holding back a Forrest release?
A bit ;)  0.4 and previous versions have all had a mechanism to exclude
certain URIs from being traversed.  Forrest's own site gives errors if
some URLs aren't excluded.
s the a timescale for it?

No particular timescale. It's been 6 months since 0.4 though, so a release soon would be nice.

A few points:

1) If you switch back to link view, would that enable you to achieve your 'excludes' requirement?

Yes, but I've gotten used to the CLI speeding along, and wouldn't like to go back.

Okay.

2) The LinkGatherer doesn't currently work, as a recent fix to caching broke it. It assumes that the LinkGatherer component isn't cached, as its 'gathering' side effect isn't cached.
Strange thing is, I haven't been able to replicate this in Forrest, after
updating locally to CVS Cocoon.  CLI rendering works fine, both on
initial and subsequent renderings.  I thought perhaps we have the buggy
cache impl, but in my tests I'm using the same excalibur-store as in
Cocoon, so I don't know what's going on.

Interesting. I think I know. Whilst hacking around, I added a getValidity() method to the LinkGatherer, thinking that that was what was breaking the cache. But I didn't commit it. I have been working from a not working caching LinkGatherer, whilst you're working with a working CVS non-caching LinkGatherer. So this is good news.

What it means is that link gathering works, but that, if you use link gathering, you can't take advantage of the new ability to write to files only if a page has changed. To get that working, I've got to get the links gathered by the LinkGatherer into the cache somehow.

3) I think I might be able to fix that (just rebuilding my Eclipse environment...), by setting the LinkGatherer to return null in response to getValitity() 4) I just started thinking about your excludes code (assuming that link gathering does start working again). Basically, there's a number of things one can exclude upon - source URI, source prefix, full source URI (prefix and URI), final destination URI . How about something like:

<exclude type="regexp| wildcard" src="source-uri | source-prefix | full-source-uri | dest-uri" match="<pattern>"/> <include type="regexp| wildcard" src="source-uri | source-prefix | full-source-uri | dest-uri" match="<pattern>"/>
I'd be happy with a simple 'ignore this link', but wildcards would be
great.
I'm a bit confused by all the @src types though. Is 'dest-uri' the final filesystem destination? Is there anything possible with src="dest-uri" that isn't possible otherwise? Does 'src-prefix' mean "ignore URIs starting with this prefix"? If so, why not just use a wildcard?

The thing is, you might want to exclude a certain URL from going to one destination but not another, so you'd need to specify a wildcard on either source or destination. However, given that a wildcard can be used to deal with prefixes, we don't need to specifically worry about prefixes. So, I propose:

<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>
<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>

I don't want to use <exclude type="source" ...> as I wan to reserve the type attribute for specifying whether to use a wildcard or regexp matcher.

Thoughts?

I've got some basic code in place to do includes/excludes - I'll keep you posted.

With include, you can have only a very narrow part of your site
crawled.
Note: I think the xconf format needs some serious rethinking, so this would be a temporary extension.
I agree, the format isn't something that can be decided up-front. I wouldn't worry too much about keeping backwards-compat.

What do you think?

I'm struggling to fit a number of projects into limited time (1 1/2 hours per day) - want to do Cocoon stuff, but need to work on some other sites), but I'm keen to get Cocoon working for you.
Thanks very much :)  I'm in the same boat, working on Forrest in the
evenings.  No rush -- there's plenty of other stuff to keep us busy
before a release.

I've just managed to shove one burning project two weeks into the future, so I'm back on for Cocoon for a while!

PS: in your CLI experiments, have you ever encountered a bug where the
last link in a page isn't crawled?  I'll try to come up with a decent
replicable example, but thought I'd mention it anyway.

To be honest, I haven't. Give me an example, and I'll look into it.

Regards, Upayavira

Re: Cocoon CLI: excluding URIs

Reply via email to