Jeff Turner wrote (on forrest-dev):
On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:Okay.
Jeff Turner wrote:
On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
The problem is with the new CLI: we have no way to exclude certain URLsI rebuilt my local Forrest doco today but i get all these strange error messages about "site:" and "ext:" URLs being broken. Here is one example... ------ ... * [0] your-project.pdf X [0] site:contrib BROKEN: No pipeline matched request: site:contrib * [48] cap.html ... ------
On the other hand, i have a project site that builds with no such
problems. So i do not know what is going on. Any clues?
from being traversed. The Forrest site gives these broken links because
Jeff,sitemap-ref.xml deliberately references some raw XML (index.xml), which contains refs to untranslated links like 'site:contrib'. It just an annoyance really -- doesn't harm the actual output.
If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support onto the Cocoon CLI so we can do a long-overdue 0.5 release.
Are you saying that the CLI is holding back a Forrest release?
A bit ;) 0.4 and previous versions have all had a mechanism to exclude certain URIs from being traversed. Forrest's own site gives errors if some URLs aren't excluded.
s the a timescale for it?No particular timescale. It's been 6 months since 0.4 though, so a
release soon would be nice.
A few points:Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
1) If you switch back to link view, would that enable you to achieve your 'excludes' requirement?
go back.
Interesting. I think I know. Whilst hacking around, I added a getValidity() method to the LinkGatherer, thinking that that was what was breaking the cache. But I didn't commit it. I have been working from a not working caching LinkGatherer, whilst you're working with a working CVS non-caching LinkGatherer. So this is good news.2) The LinkGatherer doesn't currently work, as a recent fix to caching broke it. It assumes that the LinkGatherer component isn't cached, as its 'gathering' side effect isn't cached.
Strange thing is, I haven't been able to replicate this in Forrest, after updating locally to CVS Cocoon. CLI rendering works fine, both on initial and subsequent renderings. I thought perhaps we have the buggy cache impl, but in my tests I'm using the same excalibur-store as in Cocoon, so I don't know what's going on.
What it means is that link gathering works, but that, if you use link gathering, you can't take advantage of the new ability to write to files only if a page has changed. To get that working, I've got to get the links gathered by the LinkGatherer into the cache somehow.
The thing is, you might want to exclude a certain URL from going to one destination but not another, so you'd need to specify a wildcard on either source or destination. However, given that a wildcard can be used to deal with prefixes, we don't need to specifically worry about prefixes. So, I propose:3) I think I might be able to fix that (just rebuilding my Eclipse environment...), by setting the LinkGatherer to return null in response to getValitity()
4) I just started thinking about your excludes code (assuming that link gathering does start working again). Basically, there's a number of things one can exclude upon - source URI, source prefix, full source URI (prefix and URI), final destination URI . How about something like:
<exclude type="regexp| wildcard" src="source-uri | source-prefix | full-source-uri | dest-uri" match="<pattern>"/>
<include type="regexp| wildcard" src="source-uri | source-prefix | full-source-uri | dest-uri" match="<pattern>"/>
I'd be happy with a simple 'ignore this link', but wildcards would be great.
I'm a bit confused by all the @src types though. Is 'dest-uri' the final
filesystem destination? Is there anything possible with src="dest-uri"
that isn't possible otherwise? Does 'src-prefix' mean "ignore URIs
starting with this prefix"? If so, why not just use a wildcard?
<exclude-source match="<wildcard pattern>"/> <exclude-destination match="<wildcard pattern>"/> <exclude-source match="<wildcard pattern>"/> <exclude-destination match="<wildcard pattern>"/>
I don't want to use <exclude type="source" ...> as I wan to reserve the type attribute for specifying whether to use a wildcard or regexp matcher.
Thoughts?
I've got some basic code in place to do includes/excludes - I'll keep you posted.
I've just managed to shove one burning project two weeks into the future, so I'm back on for Cocoon for a while!I agree, the format isn't something that can be decided up-front. IWith include, you can have only a very narrow part of your site crawled.
Note: I think the xconf format needs some serious rethinking, so this would be a temporary extension.
wouldn't worry too much about keeping backwards-compat.
What do you think?
I'm struggling to fit a number of projects into limited time (1 1/2 hours per day) - want to do Cocoon stuff, but need to work on some other sites), but I'm keen to get Cocoon working for you.
Thanks very much :) I'm in the same boat, working on Forrest in the evenings. No rush -- there's plenty of other stuff to keep us busy before a release.
To be honest, I haven't. Give me an example, and I'll look into it.PS: in your CLI experiments, have you ever encountered a bug where the last link in a page isn't crawled? I'll try to come up with a decent replicable example, but thought I'd mention it anyway.
Regards, Upayavira