RE: [proposal] Generic Markup Language Parser

Chris Mattmann Thu, 24 Nov 2005 21:13:48 -0800

Hi Stefan, and Jerome,

> A  mail archive is a amazing source of information, isn't it?! :-)
> To answer your question, just ask your self how many pages per second
> your plan to fetch and parse and how much queries per second a lucene
> index is able to handle - and you can deliver in the ui.
> I have here something like 200++ to a maximal 20 queries per second.
> http://wiki.apache.org/nutch/HardwareRequirements


I'm not sure that our proposal affects the ui, really at all. Parsing occurs
only during a fetch, which creates the index for the ui, no? So, why mention
the amount of queries per second that the ui can handle?

> 
> Speed improvement in ui can be done by caching components you use to
> assemble the ui. "There are some ways to improve speed"
> But seriously I don't think there will be any pages  that contains
> 'cacheable' items until parsing.
> Until last years there is one thing I notice that matters in a search
> engine - minimalism.
> There is no usage in nutch of a  logging library, 

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)

> no RMI and no meta
> data in the web db. Why?
> Minimalism.
> Minimalism == speed, speed == scalability, scalability == serious
> enterprise search engine projects.
> 
> I don't think it would be a good move to slow down html parsing (most
> used parser) to make rss parser writing more easier for developers.

This proposal isn't meant for RSS, that's seriously constraining the scope.
The proposal is meant for making writing * XML * parsers easier. Note the
"XML". RSS is a significantly small subset of XML as a whole. And, there
currently exists no default support for generic XML documents in Nutch.


> BTW, we already have a html and feed parser that works, as far I know.
> I guess 90 % of the nutch users use the html parser but only 10 % the
> feed-parser (since blogs are mostly html as well).

This may or may not be true however I wouldn't be surprised if it was
because it is representative of the division of content on the web -- HTML
definitely is orders of magnitude more pervasive than RSS.

> 
>  From my perspective we have much more general things to solve in
> nutch (manageability, monitoring, ndfs block based task-routing, more
> dynamic search servers) than improving thing we already have.

I would tend to agree with Jerome on this one -- these seem to be the items
on your agenda: a representative set indeed, but by no means an exhaustive
set of what's needed to improve, and benefit Nutch. One of the motivations
behind our proposal was several emails posted to the Nutch list by users
interested in crawling blogs and RSS:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69417.html

One of my replies to this thread was a message on October 19th, 2005, which
really identified the main problem:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69576.html

There is a lack of a general XML parser in Nutch that would allow it to deal
with general XML content based on user defined schemas and DTDs. Our
proposal would be the initial step towards a solution to this overall
problem. At least, that's part of its intention.


> Anyway as you may know we have a plugin system and one goal of the
> plugin system is to give developers the freedom to develop custom
> plugins. :-)

Indeed. And our goal is help developers in their endeavors by providing at
starting point and generic solution for XML based parsing plugins :-)

Cheers,
  Chris


> 
> Cheers,
> Stefan
> B-)
> 
> P.S. Do you think it makes sense to run another public nutch mailing
> list, since 'THE nutch [...]' (mailing list  is nutch-
> [EMAIL PROTECTED]), 'Isn't it?'
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
> 
> 
> 
> Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
> 
> > Hi Stefan,
> >
> > And thanks for taking time to read the doc and giving us your
> > feedback.
> >
> > -1!
> >> Xsl is terrible slow!
> >> Xml will blow up memory and storage usage.
> >
> > But there still something I don't understand...
> > Regarding a previous discussion we had about the use of OpenSearch
> > API to
> > replace Servlet => HTML by Servlet => XML => HTML (using xsl),
> > here is a copy of one of my comment:
> >
> > In my opinion, it is the front-end "dreamed" architecture. But more
> > pragmatically, I'm not sure it's a good idea. XSL transformation is a
> > rather slow process!! And the Nutch front-end must be very responsive.
> >
> > and then your response and Doug response too:
> > Stefan:
> > We already done experiments using XSLT.
> > There are some ways to improve speed, however it is 20 ++ % slower
> > then jsp.
> > Doug:
> > I don't think this would make a significant impact on overall Nutch
> > search
> > performance.
> > (the complete thread is available at
> > http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/
> > msg03811.html
> > )
> >
> > I'm a little bit confused... why the use of xsl must be considered
> > as too
> > time and memory expansive in the back-end process,
> > but not in the front-end?
> >
> > Dublin core may is good for semantic web, but not for a content
> > storage.
> >
> > It is not used as a content storage, but just as an intermediate
> > step: the
> > output of the xsl transformation, that will be then indexed using
> > standard
> > nutch APIs.
> > (notice that this xml file schema is perfectly mapped to Parse and
> > ParseData
> > objects)
> >
> >
> >> In general the goal must be to minimalize memory usage and improve
> >> performance such a parser would increase memory usage and definitely
> >> slow down parsing.
> >
> > Not improving the flexibility, extensibility and features?
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/

RE: [proposal] Generic Markup Language Parser

Reply via email to