[RT] New Cocoon Site Crawler Environment

Nicola Ken Barozzi Tue, 17 Dec 2002 06:54:57 -0800

Of all these discussions, one thing sticks out: we must rewrite/fix/enhance/whatever the Cocoon crawler.

Reasons:

- speed
- correct link gathering

but mostly

- speed

Why is it so slow?
Mostly because it generates each source three times.

* to get the links.
* for each link to get the mime/type.
* to get the page itself

To do this it uses two environments, the FileSavingEnvironment and the LinkSamplingEnvironment.

{~}

I've taken a look at the crawler project in Lucene sandbox, but its objectives are totally different from ours. We could in the future add a plugin to it to be able to index a Cocoon site using the link view, but it does indexing, not saving a site locally.
So our option is to do the work in Cocoon.

{~}

The three calls to Cocoon can be reduced quite easily to two, by making the call to the FileSavingEnvironment return both things at the same time and using those. Or by caching the result as the proposed Ant task in Cocoon scratchpad does.

The problem arises with the LinkSamplingEnvironment, because it uses a Cocoon view to get the links. Thus we need to ask Cocoon two things, the links and the contents.

Let's leave aside the view concept for now, and think about how to sample links from a content being produced.

We can use a LinklSamplingPipeline.
Yes, a pipeline that introduces a connector just after the "content"-tagged sitemap component and saves the links found in the environment.

Thus after the call we would have in the environment the result, the type and the links, all in one call.

In essence, we are creating a non-blocking view that runs parallelly to the main pipeline and reports the results to the environment.

This is how views are managed in the interpreted sitemap, in a transformer:

// Check view
if (this.views != null) {

//inform the pipeline that we have a branch point
context.getProcessingPipeline().informBranchPoint();

String cocoonView = env.getView();
if (cocoonView != null) {

// Get view node
ProcessingNode viewNode =
(ProcessingNode)this.views.get(cocoonView);

if (viewNode != null) {
if (getLogger().isInfoEnabled()) {
getLogger().info("Jumping to view "
+ cocoonView + " from transformer at "
+ this.getLocation());
}
return viewNode.invoke(env, context);
}
}
}

// Return false to contine sitemap invocation
return false;
}

It effectively branches and continues only with the view.

Wait, this means that when the CLI recreates a site it doesn't save the views, right?
Correct, views are simply ignored by the CLI and not created on disk. This is also due to how views are invoken in Cocoon, with a ? parameter, so they cannot be saved to disk with the correct URL.

But even if I don't save it, I may need it for internal Cocoon processing, as is that case with the crawler.

I don't know if it's best to use a special pipeline, to cache the views, or what, but we need to find a solution.

Any idea?

--
Nicola Ken Barozzi [EMAIL PROTECTED]
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

[RT] New Cocoon Site Crawler Environment

Reply via email to