On Sat, 2 Aug 2003 22:08:21 +1000, "Jeff Turner" <[EMAIL PROTECTED]> said:
> Hi,
> 
> I'm tinkering around with the CLI, thinking how to add
> don't-crawl-this-page support, and have some questions on how cli.xconf
> currently works.  The following block in cli.xconf has me confused..

Jeff. Great to see you're engaging with it!

I have also been working on the CLI. I've spent my week's spare time
completely reworking it. I'll post separately about what I've been up to,
but basically the whole thing should be much easier to understand, with a
separate crawler class, a separate class for handling Cocoon
initialisation, and another for handling URI arithmetic (which you're
talking about below). As to adding exclusions, I think it should merely
be a question of identifying the syntax. The rest, with my new code,
should be pretty easy (e.g. tell the crawler what to ignore with a set of
wildcard parameters).

I haven't been able to debug this, as my copy of Eclipse insists on
entering Java's Classloader code when I try to debug it. When I've worked
out how to stop Eclipse doing that, I'll get it debugged, and put it into
the scratchpad. 

When I've got this going, I'm going to convert the xconf code to use a
Configuration object, and then write an Ant task to do the same
ProcessXConf, so that you can have the xconf code directly in your Ant
script. This Ant task will be a simple wrapper around the bean, and
should be pretty trivial.

I have also, I think, just sorted my problem with my caching code not
working. Basically, the Cocoon cache is transient. So therefore it is
lost every time Cocoon starts. And Cocoon is started every time the CLI
starts. So if we want to have the CLI only generate new pages based upon
the cache, we've got to make the cache for the CLI persistent. Again, see
separate thread.

>   |  The old behaviour - appends uri to the specified destination
>   |  directory (as specified in <dest-dir>):
>   |
>   |   <uri>documents/index.html</uri>
> 
> Do we still want this <uri>...</uri> behaviour?  Currently the CLI only
> accepts <uri src="..."/>.  

I think someone (Joerg?) fixed a bug, that might have also disabled the
old
behaviour. I would be happy to let it go, but the benefit of it is where
you
have a lot of pages that share a destination. Otherwise you'd have to
repeat 
the destination URI for each page.

> Come to think of it, the attribute name 'src'
> doesn't really make sense.  What is the "source" of a Cocoon URI?  It
> would be the XML (documents/index.xml), which is not what we're
> specifying in @src.

It is the source for a source/destination pair. You could see it as a
cocoon: protocol source (almost). Would you suggest something different?
 
>   |  Append: append the generated page's URI to the end of the 
>   |  source URI:
>   |
>   |   <uri type="append" src-prefix="documents/" src="index.html"
>   |   dest="build/dest/"/>
> 
> What is a 'source URI' here, and why would we want to append another URI
> (URIs are not additive)?  Does this mean documents/index.html would be
> written to build/dest/?  If so, why separate @src-prefix and @src?

This is what I've started calling (after Bernard) URI Arithmetic.
Different ways to calculate your destination page from your source page
URIs.

I have to say, I haven't yet found the best language for explaining this,
so please do bear with me.

Let's take the example of Cocoon documentation. The Cocoon URI is
documents/index.html. We want the URI of the file produced to be
build/dest/index.html. So we don't want 'documents' in the destination
URI. But we need it in the source URI. So we therefore use this as the
src-prefix, i.e. it is included in the source URI, but excluded from the
destination URI.

Now, why have 'append', 'replace', etc? Well, sometimes you will want to
append the source URI to the destination URI - in our case appending
'index.html' to 'build/dest/' gives 'build/dest/index.html', which is
what we want. But also, if we crawl on to news.html, adding that to
'build/dest/' will give us 'build/dest/news.html', which again is what we
want.

However, a scenario I have is where no crawling is taking place, and
there is no relationship between the source and destination URIs. So for
example: /site/page1.html could be saved as /foobar/client1.html. In that
scenario one would use 'REPLACE' as the type.

>   |  Replace: Completely ignore the generated page's URI - just 
>   |  use the destination URI:
>   |
>   |   <uri type="replace" src-prefix="documents/" src="index.html" 
>   |   dest="build/dest/docs.html"/>
> 
> Sounds fine, but again, since we know the whole URI
> (documents/index.html), why separate into @src-prefix and @src?

In this scenario, the src-prefix isn't really needed, as the src is
ignored when calculating the destination uri.
 
>   |  Insert: Insert generated page's URI into the destination 
>   |  URI at the point marked with a * (example uses fictional 
>   |  zip protocol)
>   |
>   |   <uri type="insert" src-prefix="documents/" src="index.html" 
>   |   dest="zip://*.zip/page.html"/>
> 
> Leaves me very confused.. what would be the result here?  An index.zip
> file, containing the bytes from documents/index.html saved as page.html?
> Is there a non-fictional scenario where this makes more sense? :)

Fraid there isn't a non-fictional one ATM. This one was put there simply
for completeness (only took minutes to implement). To my mind, it is
append and replace that are the most important features.

> Anyway, on to the subject of excluding certain URIs.. are there any
> preferred ways of doing it?  I've currently got:
> 
>   <ignore-uri>....</ignore-uri>
> 
> working, which seems crude but effective.  Ideally I'd like to:
>  - Use wildcards ("don't crawl '*.xml' URLs")
>  - be able to exclude links based on which page they originate from
>    ("ignore broken links from sitemap-ref.html")
> 
> I was thinking of some sort of nesting notation for indicating links from
> a certain page:
> 
>   <!-- Ignore *.xml links from sitemap-ref.* -->
>   <ignore from-uri="sitemap-ref.*"> 
>       <uri>*.xml</uri>   
>   </ignore>

*************************************


> Sorry I don't have any answers or even particularly coherent questions ;)

Neither have I!

> I have the feeling that cli.xconf's job, mapping URIs to the filesystem,
> could potentially be quite intricate.  It is roughly an inverse of what
> the sitemap does.  Perhaps we need an analogous syntax?

Perhaps. I think we've only just started trying to work out what is
possible here. I'd be pleased to carry on the conversation, as what we
have at the moment is purely what I thought best, and not the result of
much community discussion.

There's alot we could discuss here. For example, how do we handle the
situation where we want to crawl a number of pages, but don't want to
have to repeat the destination for each of them? I think we could come up
with an elegant configuration for this. My <uri> thing is only the
beginning. 

The first thing to do is to start identifying the possible use cases for
URI mappings, so that we can see the range of the problem we're trying to
solve (and take it beyond the scope of just fixing my problems only!).

I have said previously that the Bean interface should be declared
alpha/unstable. By the sounds of it we also need to declare the xconf
structure to be unstable too. See separate thread!

Regards, Upayavira

Reply via email to