[Nutch-dev] Re: [proposal] Generic Markup Language Parser

Stefan Groschupf Thu, 24 Nov 2005 11:31:59 -0800

Jérôme,

A  mail archive is a amazing source of information, isn't it?! :-)

To answer your question, just ask your self how many pages per secondyour plan to fetch and parse and how much queries per second a luceneindex is able to handle - and you can deliver in the ui.

I have here something like 200++ to a maximal 20 queries per second.
http://wiki.apache.org/nutch/HardwareRequirements

Speed improvement in ui can be done by caching components you use toassemble the ui. "There are some ways to improve speed"But seriously I don't think there will be any pages that contains'cacheable' items until parsing.Until last years there is one thing I notice that matters in a searchengine - minimalism.There is no usage in nutch of a logging library, no RMI and no metadata in the web db. Why?

Minimalism.

Minimalism == speed, speed == scalability, scalability == seriousenterprise search engine projects.

I don't think it would be a good move to slow down html parsing (mostused parser) to make rss parser writing more easier for developers.

BTW, we already have a html and feed parser that works, as far I know.

I guess 90 % of the nutch users use the html parser but only 10 % thefeed-parser (since blogs are mostly html as well).

From my perspective we have much more general things to solve innutch (manageability, monitoring, ndfs block based task-routing, moredynamic search servers) than improving thing we already have.Anyway as you may know we have a plugin system and one goal of theplugin system is to give developers the freedom to develop customplugins. :-)


Cheers,
Stefan
B-)

P.S. Do you think it makes sense to run another public nutch mailinglist, since 'THE nutch [...]' (mailing list is nutch-[EMAIL PROTECTED]), 'Isn't it?'

http://www.mail-archive.com/[email protected]/msg01513.html



Am 24.11.2005 um 19:28 schrieb Jérôme Charron:

Hi Stefan,
And thanks for taking time to read the doc and giving us yourfeedback.
-1!
Xsl is terrible slow!
Xml will blow up memory and storage usage.
But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearchAPI to
replace Servlet => HTML by Servlet => XML => HTML (using xsl),
here is a copy of one of my comment:

In my opinion, it is the front-end "dreamed" architecture. But more
pragmatically, I'm not sure it's a good idea. XSL transformation is a
rather slow process!! And the Nutch front-end must be very responsive.

and then your response and Doug response too:
Stefan:
We already done experiments using XSLT.
There are some ways to improve speed, however it is 20 ++ % slowerthen jsp.
Doug:
I don't think this would make a significant impact on overall Nutchsearch
performance.
(the complete thread is available at
http://www.mail-archive.com/[email protected]/msg03811.html
)
I'm a little bit confused... why the use of xsl must be consideredas too
time and memory expansive in the back-end process,
but not in the front-end?
Dublin core may is good for semantic web, but not for a contentstorage.
It is not used as a content storage, but just as an intermediatestep: theoutput of the xsl transformation, that will be then indexed usingstandard
nutch APIs.
(notice that this xml file schema is perfectly mapped to Parse andParseData
objects)
In general the goal must be to minimalize memory usage and improve
performance such a parser would increase memory usage and definitely
slow down parsing.
Not improving the flexibility, extensibility and features?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: [proposal] Generic Markup Language Parser

Reply via email to