RE: [proposal] Generic Markup Language Parser

Chris Mattmann Thu, 24 Nov 2005 20:50:16 -0800

Hi Stefan,


> -1!
> Xsl is terrible slow!

You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
"backend", as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifically, XML based content, then XSL will be
used to create an intermediary "parse-out" xml file, containing the fields
to index. I don't think, given the percentage of xml-based markup content
out there (of course excluding "html"), compared to regular content, that
this would significantly degrade performance. 

> Xml will blow up memory and storage usage.

Possibly, but I would think that we would do it in a clever fashion. For
instance, the parse-out xml files would most likely be small (~kb) files
that could be deleted if space is a concern. It could be a parameterized
option. 

> Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that. In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model for electronic
resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", or whatever you
want to call it, I think it would be nice to provide the DC metadata because
at least it is well known and provides interoperability with other content
storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

> In general the goal must be to minimalize memory usage and improve
> performance such a parser would increase memory usage and definitely
> slow down parsing.

I dont think it would slow down parsing significantly, as I mentioned above
markup content represents a small portion of the amount of content out
there.

> The magic world is minimalism.
> So I vote against this suggestion!
> Stefan

In general, this proposal represents a step forward in being able to parse
generic XML content in Nutch, which is a very challenging problem. Thanks
for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.


Cheers,
   Chris Mattmann

> 
> 
> 
> 
> 
> Am 24.11.2005 um 00:01 schrieb Jérôme Charron:
> 
> > Hi,
> >
> > We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and
> > me) just
> > add a new proposal on the nutch Wiki:
> > http://wiki.apache.org/nutch/MarkupLanguageParserProposal
> >
> > Here is the Summary of Issue:
> > "Currently, Nutch provides some specific markup language parsing
> > plugins:
> > one for handling HTML, another one for RSS, but no generic XML parsing
> > plugin. This is extremely cumbersome as adding support for a new
> > markup
> > language implies that you have to develop the whole XML parsing
> > code from
> > scratch. This methodology causes: (1) code duplication, with little
> > or no
> > reuse of common pieces of XML parsing code, and (2) dependency library
> > duplication, where many XML parsing plugins may rely on similar xml
> > parsing
> > libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing
> > plugin
> > keeps its own local copy of these libraries. It is also very
> > difficult to
> > identify precisely the type of XML content encountered during a
> > parse. That
> > difficult issue is outside the scope of this proposal, and will be
> > identified in a future proposal."
> >
> > Thanks for your feedback, comments, suggestions (and votes).
> >
> > Regards
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/

RE: [proposal] Generic Markup Language Parser

Reply via email to