El lun, 03-04-2006 a las 12:34 +0100, Upayavira escribió: > Thorsten Scherler wrote: > > El lun, 03-04-2006 a las 09:00 +0100, Upayavira escribió: > >> David Crossley wrote: > >>> Upayavira wrote: > >>>> Sylvain Wallez wrote: > >>>>> Carsten Ziegeler wrote: > >>>>>> Sylvain Wallez wrote: > >>>>>>> Hmm... the current CLI uses Cocoon's links view to crawl the website. > >>>>>>> So > >>>>>>> although the new crawler can be based on servlets, it will assume > >>>>>>> these > >>>>>>> servlets to answer to a ?cocoon-view=links :-) > >>>>>>> > >>>>>> Hmm, I think we don't need the links view in this case anymore. A > >>>>>> simple > >>>>>> HTML crawler should be enough as it will follow all links on the page. > >>>>>> The view would only make sense in the case where you don't output html > >>>>>> where the usual crawler tools would not work. > >>>>>> > >>>>> In the case of Forrest, you're probably right. Now the links view also > >>>>> allows to follow links in pipelines producing something that's not HTML, > >>>>> such as PDF, SVG, WML, etc. > >>>>> > >>>>> We have to decide if we want to loose this feature. > >>> I am not sure if we use this in Forrest. If not > >>> then we probably should be. > >>> > >>>> In my view, the whole idea of crawling (i.e. gathering links from pages) > >>>> is suboptimal anyway. For example, some sites don't directly link to all > >>>> pages (e.g. they are accessed via javascript, or whatever) so you get > >>>> pages missed. > >>>> > >>>> Were I to code a new CLI, whilst I would support crawling I would mainly > >>>> configure the CLI to get the list of pages to visit by calling one or > >>>> more URLs. Those URLs would specify the pages to generate. > >>>> > >>>> Thus, Forrest would transform its site.xml file into this list of pages, > >>>> and drive the CLI via that. > >>> This is what we do do. We have a property > >>> "start-uri=linkmap.html" > >>> http://forrest.zones.apache.org/ft/build/cocoon-docs/linkmap.html > >>> (we actually use corresponding xml of course). > >>> > >>> We define a few extra URIs in the Cocoon cli.xconf > >>> > >>> There are issues of course. Sometimes we want to > >>> include directories of files that are not referenced > >>> in site.xml navigation. For my sites i just use a > >>> DirectoryGenerator to build an index page which feeds > >>> the crawler. Sometime that technique is not sufficent. > >>> > >>> We also gather links from text files (e.g. CSS) > >>> using Chaperon. This works nicely but introduces > >>> some overhead. > >> This more or less confirms my suggested approach - allow crawling at the > >> 'end-point' HTML, but more importantly, use a page/URL to identify the > >> pages to be crawled. The interesting thing from what you say is that > >> this page could itself be nothing more than HTML. > > > > Well, yes and not really, since e.g. Chaperon is text based and no > > markup. You need a lex-writer to generate links for the crawler. > > Yes. You misunderstand me I think.
Yes, sorry I did misunderstood you. > Even if you use Chaperon etc to parse > markup, there'd be no difficulty expressing the links that you found as > an HTML page - one intended to be consumed by the CLI, not to be > publically viewed. Well in the case of css you want them as well publically viewed but I got your point. ;) > In fact, if it were written to disc, forrest would > probably delete it afterwards. > > > Forrest actually is *not* aimed for html only support and one can think > > of the situation that you want your site to be only txt (kind of a > > book). Here you need to crawler the lex-rewriter outcome and follow the > > links. > > Hopefully I've shown that I had understood that already :-) yeah ;) > > > The current limitation of forrest regarding the crawler are IMO not > > caused by the crawler design but rather by our (as in forrest) usage of > > it. > > Yep, fair enough. But if the CLI is going to survive the shift that is > happening in Cocoon trunk, something big needs to be done by someone. It > cannot survive in its current form as the code it uses is changing > almost beyond recognition. > > Heh, perhaps the Cocoon CLI should just be a Maven plugin. ...or forrest plugin. ;) This would makes it possible that cocoon, lenya and forrest committer can help. Kind of http://svn.apache.org/viewcvs.cgi/lenya/sandbox/doco/ ;) salu2 -- thorsten "Together we stand, divided we fall!" Hey you (Pink Floyd)