Re: [RT] New Cocoon Site Crawler Environment

Bernhard Huber Tue, 17 Dec 2002 12:25:56 -0800

Hi,

Nicola Ken Barozzi wrote:

Of all these discussions, one thing sticks out: we must rewrite/fix/enhance/whatever the Cocoon crawler.

Reasons:

- speed
- correct link gathering

but mostly

- speed

Why is it so slow?
Mostly because it generates each source three times.

* to get the links.
* for each link to get the mime/type.
* to get the page itself

To do this it uses two environments, the FileSavingEnvironment and the LinkSamplingEnvironment.

{~}

I've taken a look at the crawler project in Lucene sandbox, but its objectives are totally different from ours. We could in the future add a plugin to it to be able to index a Cocoon site using the link view, but it does indexing, not saving a site locally.
So our option is to do the work in Cocoon.

{~}

The three calls to Cocoon can be reduced quite easily to two, by making the call to the FileSavingEnvironment return both things at the same time and using those. Or by caching the result as the proposed Ant task in Cocoon scratchpad does.

yup,

The problem arises with the LinkSamplingEnvironment, because it uses a Cocoon view to get the links. Thus we need to ask Cocoon two things, the links and the contents.

<big snip/>

ask Cocoon two things, make a Generator/Transformer to do the two thinks,

I now play around with a SourceLinkStatusGenerator, which is like
StatusGenerator but does not request the links of a page via http: call,
but via processor.process() call, it does it recursivly, does you ask
SourceLinkStatusGenerator give me all links outbounded links of index.html, and it will return an xml document with all links of the pages reachable from index.html.

You ask Cocoon give me the content of page index.html plus its out bounding links.

The only problem I see you will get not text/html if you ask Cocoon this
question but text/html+application/x-cocoon-links response - taking the index.html example of above.

Moreover you might have to adopt the sitemap to let's
<map:match pattern="crawling"> and asking within this map:match
cocoon the right question?

Hmm, if you rely on links, you might want LinkTransformer, not to throw away the page content, but to harvest the links content-no-destructive.
Hmm, that would be the best no big sitemap changes, just another
transforming step, instead of type="xslt" src="linkstatus.xslt"
the new LinkAndContentTransformer step, but the content-type issue stays.

btw, thxs for starting this RT, i don't have the passion to initiate this, but it is neccessary, and i appreciate it.

bye bernhard

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: [RT] New Cocoon Site Crawler Environment

Reply via email to