Jérôme,

A  mail archive is a amazing source of information, isn't it?! :-)
To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui.
I have here something like 200++ to a maximal 20 queries per second.
http://wiki.apache.org/nutch/HardwareRequirements

Speed improvement in ui can be done by caching components you use to assemble the ui. "There are some ways to improve speed" But seriously I don't think there will be any pages that contains 'cacheable' items until parsing. Until last years there is one thing I notice that matters in a search engine - minimalism. There is no usage in nutch of a logging library, no RMI and no meta data in the web db. Why?
Minimalism.
Minimalism == speed, speed == scalability, scalability == serious enterprise search engine projects.

I don't think it would be a good move to slow down html parsing (most used parser) to make rss parser writing more easier for developers.
BTW, we already have a html and feed parser that works, as far I know.
I guess 90 % of the nutch users use the html parser but only 10 % the feed-parser (since blogs are mostly html as well).

From my perspective we have much more general things to solve in nutch (manageability, monitoring, ndfs block based task-routing, more dynamic search servers) than improving thing we already have. Anyway as you may know we have a plugin system and one goal of the plugin system is to give developers the freedom to develop custom plugins. :-)

Cheers,
Stefan
B-)

P.S. Do you think it makes sense to run another public nutch mailing list, since 'THE nutch [...]' (mailing list is nutch- [EMAIL PROTECTED]), 'Isn't it?'
http://www.mail-archive.com/[email protected]/msg01513.html



Am 24.11.2005 um 19:28 schrieb Jérôme Charron:

Hi Stefan,

And thanks for taking time to read the doc and giving us your feedback.

-1!
Xsl is terrible slow!
Xml will blow up memory and storage usage.

But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearch API to
replace Servlet => HTML by Servlet => XML => HTML (using xsl),
here is a copy of one of my comment:

In my opinion, it is the front-end "dreamed" architecture. But more
pragmatically, I'm not sure it's a good idea. XSL transformation is a
rather slow process!! And the Nutch front-end must be very responsive.

and then your response and Doug response too:
Stefan:
We already done experiments using XSLT.
There are some ways to improve speed, however it is 20 ++ % slower then jsp.
Doug:
I don't think this would make a significant impact on overall Nutch search
performance.
(the complete thread is available at
http://www.mail-archive.com/[email protected]/ msg03811.html
)

I'm a little bit confused... why the use of xsl must be considered as too
time and memory expansive in the back-end process,
but not in the front-end?

Dublin core may is good for semantic web, but not for a content storage.

It is not used as a content storage, but just as an intermediate step: the output of the xsl transformation, that will be then indexed using standard
nutch APIs.
(notice that this xml file schema is perfectly mapped to Parse and ParseData
objects)


In general the goal must be to minimalize memory usage and improve
performance such a parser would increase memory usage and definitely
slow down parsing.

Not improving the flexibility, extensibility and features?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to