Hello,
I do agree with Andrzej. I do not see it as a solution for for parse-html. But generic XML plugin maybe will have some use for some people (even if not for me).
Regards
Piotr


Andrzej Bialecki wrote:
Stefan Groschupf wrote:

[...]

Gentlemen, please let's keep a civilized tone to this exchange, or take it off the list.

I applaud this effort, I can certainly sympathize with its goals - just the other day I struggled with parsing an XML feed into Nutch segments. It would be very welcome to have a generic platform to handle all kinds of XML input and a way to express mappings from any XML schema to a standard metadata, as it is used in Nutch.

You don't have to use XSL to accomplish this - an XPath processor would do fine in many cases. Even if you use XSL, and you avoid certain costly constructs, you can keep a decent performance, with the benefit of flexibility and standards-compliance that comes with XSL (people already know how to use it).

At the same time I see little benefit of creating an intermediate XML - as soon as the data extraction is completed the same information can be passed perfectly well using the Nutch internal classses (ParseImpl and friends) - unless you want to replace the original Content in segments with this intermediate XML.

I also don't think this solution would be suitable for parse-html, where the top-notch performance is crucial and where by default we have to deal with non-valid or even non well-formed documents - and fixing, parsing and extracting in one step, as we do it today, seems to be the most efficient way to go. So, I very much doubt you will be able to get the same performance if you use your approach.

So, if you add this as a generic parse-xml framework, to be used where it makes sense in terms of flexibility and performance - I think this would change very little for those who are not interested in XML content, but it would be a big help for those who have to deal with it.




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to