Hi,

We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just
add a new proposal on the nutch Wiki:
http://wiki.apache.org/nutch/MarkupLanguageParserProposal

Here is the Summary of Issue:
"Currently, Nutch provides some specific markup language parsing plugins:
one for handling HTML, another one for RSS, but no generic XML parsing
plugin. This is extremely cumbersome as adding support for a new markup
language implies that you have to develop the whole XML parsing code from
scratch. This methodology causes: (1) code duplication, with little or no
reuse of common pieces of XML parsing code, and (2) dependency library
duplication, where many XML parsing plugins may rely on similar xml parsing
libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing plugin
keeps its own local copy of these libraries. It is also very difficult to
identify precisely the type of XML content encountered during a parse. That
difficult issue is outside the scope of this proposal, and will be
identified in a future proposal."

Thanks for your feedback, comments, suggestions (and votes).

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to