Re: [RT] New Cocoon Site Crawler Environment

Vadim Gritsenko Tue, 17 Dec 2002 19:03:31 -0800

Nicola Ken Barozzi wrote:

Vadim Gritsenko wrote:

Nicola Ken Barozzi wrote:
...

Why is it so slow?
Mostly because it generates each source three times.

[...]

Thus after the call we would have in the environment the result, the type and the links, all in one call.

Type and links - yes, I agree. Content - no, we won't get correct content because links will not be translated in this content. And produced content is impossible to "re-link" because it can be any binary format supporting links (MS Excel, PDF, MS Word, ...)

Ok, you are correct.

Please add here the results we have come to in our fast AIM discussion, I have to run now.

Ok, here is the thing. It is possible to get everything in one call (and - this remark goes to Berin - without increase in resource consumption), if we (re)move translateURI functionality from the Main. Problem is that this getType() method is used only for one purpose - to decide on a good name for the resulting file, to decide on a good extension according to the MIMEUtils settings. And another problem is that this getLinks is used only to collect this information (about good names) and deliver it to the LinkTranslator transformer, which does actual work of replacing links.

So, if we remove link translation from the Main.java, where it can go and how it should be done? There are several options.

1) Do not change names.
This works for everything except URIs ending with "/" - and for such URIs, we can use existing solution - add Constants.INDEX_URI to the end.
Points in favor of this method:
* generated site will be close to the live site with regards to file names.
* in Main java, there is need in only one call.
2) Change names according to the translation table supplied to the Main by the user.
This solution provides some flexibility (may be too much of it).
Points in favor of this method:
* Flexibility.
* Same as above.
3) Change names as we done that before - by utilizing MIMEUtils.
Points in favor of this method:
* This is backward compatible way.
* We still have to know types of all links to do translation. Which means, extra getType() call on every link (excluding duplicates - information is cached). Hm, this one, actually, is not in favor...

And this name translation can happen in LinkTranslator transformer which currently does link translation magic. If we move all URI translation logic, whatever it will be (see points 1-3 above), it will be possible to implement Main in one step instead of three steps.

Exclusion being the case (3), where complexity will be added to LinkTranslator, but still, we will reduce calls from 3 (per link) to 2.

Thanks :-)


You are welcome. Hope I tell story quite understandable.

But, there is hope to get all in once - if LinkSamplingTransformer will also be LinkTranslatingTransformer and will call Main back on every new link (recursive processing - as opposed to iterative processing in current implementation of the Main). The drawback of recursion approach is increased memory consumption.
NAO = not an option


Yes, it was totally wrong idea from my side.

It doesn't scale, you are right.


And it never did. Amen.

Vadim



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: [RT] New Cocoon Site Crawler Environment

Reply via email to