Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch/MarkupLanguageParserProposal
Here is the Summary of Issue: "Currently, Nutch provides some specific markup language parsing plugins: one for handling HTML, another one for RSS, but no generic XML parsing plugin. This is extremely cumbersome as adding support for a new markup language implies that you have to develop the whole XML parsing code from scratch. This methodology causes: (1) code duplication, with little or no reuse of common pieces of XML parsing code, and (2) dependency library duplication, where many XML parsing plugins may rely on similar xml parsing libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing plugin keeps its own local copy of these libraries. It is also very difficult to identify precisely the type of XML content encountered during a parse. That difficult issue is outside the scope of this proposal, and will be identified in a future proposal." Thanks for your feedback, comments, suggestions (and votes). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
