Jérôme,
A mail archive is a amazing source of information, isn't it?! :-)
To answer your question, just ask your self how many pages per second
your plan to fetch and parse and how much queries per second a lucene
index is able to handle - and you can deliver in the ui.
I have here something like 200++ to a maximal 20 queries per second.
http://wiki.apache.org/nutch/HardwareRequirements
Speed improvement in ui can be done by caching components you use to
assemble the ui. "There are some ways to improve speed"
But seriously I don't think there will be any pages that contains
'cacheable' items until parsing.
Until last years there is one thing I notice that matters in a search
engine - minimalism.
There is no usage in nutch of a logging library, no RMI and no meta
data in the web db. Why?
Minimalism.
Minimalism == speed, speed == scalability, scalability == serious
enterprise search engine projects.
I don't think it would be a good move to slow down html parsing (most
used parser) to make rss parser writing more easier for developers.
BTW, we already have a html and feed parser that works, as far I know.
I guess 90 % of the nutch users use the html parser but only 10 % the
feed-parser (since blogs are mostly html as well).
From my perspective we have much more general things to solve in
nutch (manageability, monitoring, ndfs block based task-routing, more
dynamic search servers) than improving thing we already have.
Anyway as you may know we have a plugin system and one goal of the
plugin system is to give developers the freedom to develop custom
plugins. :-)
Cheers,
Stefan
B-)
P.S. Do you think it makes sense to run another public nutch mailing
list, since 'THE nutch [...]' (mailing list is nutch-
[EMAIL PROTECTED]), 'Isn't it?'
http://www.mail-archive.com/[email protected]/msg01513.html
Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
Hi Stefan,
And thanks for taking time to read the doc and giving us your
feedback.
-1!
Xsl is terrible slow!
Xml will blow up memory and storage usage.
But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearch
API to
replace Servlet => HTML by Servlet => XML => HTML (using xsl),
here is a copy of one of my comment:
In my opinion, it is the front-end "dreamed" architecture. But more
pragmatically, I'm not sure it's a good idea. XSL transformation is a
rather slow process!! And the Nutch front-end must be very responsive.
and then your response and Doug response too:
Stefan:
We already done experiments using XSLT.
There are some ways to improve speed, however it is 20 ++ % slower
then jsp.
Doug:
I don't think this would make a significant impact on overall Nutch
search
performance.
(the complete thread is available at
http://www.mail-archive.com/[email protected]/
msg03811.html
)
I'm a little bit confused... why the use of xsl must be considered
as too
time and memory expansive in the back-end process,
but not in the front-end?
Dublin core may is good for semantic web, but not for a content
storage.
It is not used as a content storage, but just as an intermediate
step: the
output of the xsl transformation, that will be then indexed using
standard
nutch APIs.
(notice that this xml file schema is perfectly mapped to Parse and
ParseData
objects)
In general the goal must be to minimalize memory usage and improve
performance such a parser would increase memory usage and definitely
slow down parsing.
Not improving the flexibility, extensibility and features?
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers