Hello,
I do agree with Andrzej. I do not see it as a solution for for
parse-html. But generic XML plugin maybe will have some use for some
people (even if not for me).
Regards
Piotr
Andrzej Bialecki wrote:
Stefan Groschupf wrote:
[...]
Gentlemen, please let's keep a civilized tone to this exchange, or take
it off the list.
I applaud this effort, I can certainly sympathize with its goals - just
the other day I struggled with parsing an XML feed into Nutch segments.
It would be very welcome to have a generic platform to handle all kinds
of XML input and a way to express mappings from any XML schema to a
standard metadata, as it is used in Nutch.
You don't have to use XSL to accomplish this - an XPath processor would
do fine in many cases. Even if you use XSL, and you avoid certain costly
constructs, you can keep a decent performance, with the benefit of
flexibility and standards-compliance that comes with XSL (people already
know how to use it).
At the same time I see little benefit of creating an intermediate XML -
as soon as the data extraction is completed the same information can be
passed perfectly well using the Nutch internal classses (ParseImpl and
friends) - unless you want to replace the original Content in segments
with this intermediate XML.
I also don't think this solution would be suitable for parse-html, where
the top-notch performance is crucial and where by default we have to
deal with non-valid or even non well-formed documents - and fixing,
parsing and extracting in one step, as we do it today, seems to be the
most efficient way to go. So, I very much doubt you will be able to get
the same performance if you use your approach.
So, if you add this as a generic parse-xml framework, to be used where
it makes sense in terms of flexibility and performance - I think this
would change very little for those who are not interested in XML
content, but it would be a big help for those who have to deal with it.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers