We would like to parse the content of a FOAF
(http://www.foaf-project.org) file (a RDF file). We get the URI of the
file from HTML, in the <head>'s <link> elements defined as :

<link rel="meta" type="application/rdf+xml" title="FOAF"
href="URI/to/foaf.rdf" />

Does Nutch automatically schedules for fetching the href attribute
value? If not, what could we do to fetch and parse it? Here's our
guess solution for now :

a) Create a HTMLParseFilter plugin that parses the HTML document to
find any <link> element and add the href attribute value to the list
of documents to be fetched.

b) Create a parse plugin that is associated with the
"application/rdf+xml" content-type

What do you guys think?


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to