On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <[EMAIL PROTECTED]> said: > Hi, > > I'm tinkering around with the CLI, thinking how to add > don't-crawl-this-page support, and have some questions on how cli.xconf > currently works. The following block in cli.xconf has me confused..
Jeff. Great to see you're engaging with it! I have also been working on the CLI. I've spent my week's spare time completely reworking it. I'll post separately about what I've been up to, but basically the whole thing should be much easier to understand, with a separate crawler class, a separate class for handling Cocoon initialisation, and another for handling URI arithmetic (which you're talking about below). As to adding exclusions, I think it should merely be a question of identifying the syntax. The rest, with my new code, should be pretty easy (e.g. tell the crawler what to ignore with a set of wildcard parameters). I haven't been able to debug this, as my copy of Eclipse insists on entering Java's Classloader code when I try to debug it. When I've worked out how to stop Eclipse doing that, I'll get it debugged, and put it into the scratchpad. When I've got this going, I'm going to convert the xconf code to use a Configuration object, and then write an Ant task to do the same ProcessXConf, so that you can have the xconf code directly in your Ant script. This Ant task will be a simple wrapper around the bean, and should be pretty trivial. I have also, I think, just sorted my problem with my caching code not working. Basically, the Cocoon cache is transient. So therefore it is lost every time Cocoon starts. And Cocoon is started every time the CLI starts. So if we want to have the CLI only generate new pages based upon the cache, we've got to make the cache for the CLI persistent. Again, see separate thread. > | The old behaviour - appends uri to the specified destination > | directory (as specified in <dest-dir>): > | > | <uri>documents/index.html</uri> > > Do we still want this <uri>...</uri> behaviour? Currently the CLI only > accepts <uri src="..."/>. I think someone (Joerg?) fixed a bug, that might have also disabled the old behaviour. I would be happy to let it go, but the benefit of it is where you have a lot of pages that share a destination. Otherwise you'd have to repeat the destination URI for each page. > Come to think of it, the attribute name 'src' > doesn't really make sense. What is the "source" of a Cocoon URI? It > would be the XML (documents/index.xml), which is not what we're > specifying in @src. It is the source for a source/destination pair. You could see it as a cocoon: protocol source (almost). Would you suggest something different? > | Append: append the generated page's URI to the end of the > | source URI: > | > | <uri type="append" src-prefix="documents/" src="index.html" > | dest="build/dest/"/> > > What is a 'source URI' here, and why would we want to append another URI > (URIs are not additive)? Does this mean documents/index.html would be > written to build/dest/? If so, why separate @src-prefix and @src? This is what I've started calling (after Bernard) URI Arithmetic. Different ways to calculate your destination page from your source page URIs. I have to say, I haven't yet found the best language for explaining this, so please do bear with me. Let's take the example of Cocoon documentation. The Cocoon URI is documents/index.html. We want the URI of the file produced to be build/dest/index.html. So we don't want 'documents' in the destination URI. But we need it in the source URI. So we therefore use this as the src-prefix, i.e. it is included in the source URI, but excluded from the destination URI. Now, why have 'append', 'replace', etc? Well, sometimes you will want to append the source URI to the destination URI - in our case appending 'index.html' to 'build/dest/' gives 'build/dest/index.html', which is what we want. But also, if we crawl on to news.html, adding that to 'build/dest/' will give us 'build/dest/news.html', which again is what we want. However, a scenario I have is where no crawling is taking place, and there is no relationship between the source and destination URIs. So for example: /site/page1.html could be saved as /foobar/client1.html. In that scenario one would use 'REPLACE' as the type. > | Replace: Completely ignore the generated page's URI - just > | use the destination URI: > | > | <uri type="replace" src-prefix="documents/" src="index.html" > | dest="build/dest/docs.html"/> > > Sounds fine, but again, since we know the whole URI > (documents/index.html), why separate into @src-prefix and @src? In this scenario, the src-prefix isn't really needed, as the src is ignored when calculating the destination uri. > | Insert: Insert generated page's URI into the destination > | URI at the point marked with a * (example uses fictional > | zip protocol) > | > | <uri type="insert" src-prefix="documents/" src="index.html" > | dest="zip://*.zip/page.html"/> > > Leaves me very confused.. what would be the result here? An index.zip > file, containing the bytes from documents/index.html saved as page.html? > Is there a non-fictional scenario where this makes more sense? :) Fraid there isn't a non-fictional one ATM. This one was put there simply for completeness (only took minutes to implement). To my mind, it is append and replace that are the most important features. > Anyway, on to the subject of excluding certain URIs.. are there any > preferred ways of doing it? I've currently got: > > <ignore-uri>....</ignore-uri> > > working, which seems crude but effective. Ideally I'd like to: > - Use wildcards ("don't crawl '*.xml' URLs") > - be able to exclude links based on which page they originate from > ("ignore broken links from sitemap-ref.html") > > I was thinking of some sort of nesting notation for indicating links from > a certain page: > > <!-- Ignore *.xml links from sitemap-ref.* --> > <ignore from-uri="sitemap-ref.*"> > <uri>*.xml</uri> > </ignore> ************************************* > Sorry I don't have any answers or even particularly coherent questions ;) Neither have I! > I have the feeling that cli.xconf's job, mapping URIs to the filesystem, > could potentially be quite intricate. It is roughly an inverse of what > the sitemap does. Perhaps we need an analogous syntax? Perhaps. I think we've only just started trying to work out what is possible here. I'd be pleased to carry on the conversation, as what we have at the moment is purely what I thought best, and not the result of much community discussion. There's alot we could discuss here. For example, how do we handle the situation where we want to crawl a number of pages, but don't want to have to repeat the destination for each of them? I think we could come up with an elegant configuration for this. My <uri> thing is only the beginning. The first thing to do is to start identifying the possible use cases for URI mappings, so that we can see the range of the problem we're trying to solve (and take it beyond the scope of just fixing my problems only!). I have said previously that the Bean interface should be declared alpha/unstable. By the sounds of it we also need to declare the xconf structure to be unstable too. See separate thread! Regards, Upayavira