Stefan Groschupf wrote: [...]
Gentlemen, please let's keep a civilized tone to this exchange, or take it off the list.
I applaud this effort, I can certainly sympathize with its goals - just the other day I struggled with parsing an XML feed into Nutch segments. It would be very welcome to have a generic platform to handle all kinds of XML input and a way to express mappings from any XML schema to a standard metadata, as it is used in Nutch.
You don't have to use XSL to accomplish this - an XPath processor would do fine in many cases. Even if you use XSL, and you avoid certain costly constructs, you can keep a decent performance, with the benefit of flexibility and standards-compliance that comes with XSL (people already know how to use it).
At the same time I see little benefit of creating an intermediate XML - as soon as the data extraction is completed the same information can be passed perfectly well using the Nutch internal classses (ParseImpl and friends) - unless you want to replace the original Content in segments with this intermediate XML.
I also don't think this solution would be suitable for parse-html, where the top-notch performance is crucial and where by default we have to deal with non-valid or even non well-formed documents - and fixing, parsing and extracting in one step, as we do it today, seems to be the most efficient way to go. So, I very much doubt you will be able to get the same performance if you use your approach.
So, if you add this as a generic parse-xml framework, to be used where it makes sense in terms of flexibility and performance - I think this would change very little for those who are not interested in XML content, but it would be a big help for those who have to deal with it.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
