Re: [Nutch-general] Re: RDF plugin questions

Erik Hatcher Thu, 21 Jul 2005 08:32:41 -0700

Stefan - thanks for the reply. I'm still digesting Nutch and how towork with it at a basic level but it does make sense to allowmetadata to tag along with fetches - I certainly don't know enoughyet to say whether your patch fits into the long-term vision of Nutchor not yet.

I've started writing a custom RDF parser plugin that will take theURL and simply add it to Kowari (letting Kowari actually parse it andingest it). But I'm feeling like this might not be the best approach.

At what stage would make the most sense for ingesting RDF into anexternal system? Is parsing the most logical stage?

Further on this topic, I'm curious about indexing multiple"documents" per .rdf file fetched - for instance, one document perRDF "resource". Is this currently possible with a plugin? If not,what would it take to do something like this? Maybe this approachdoesn't even make sense in the Nutch sense - I'm just exploring myarchitectural options.


Thanks,
    Erik



On Jul 19, 2005, at 9:19 AM, Stefan Groschupf wrote:

Hi Erik,
as far I know you have not the source page a url comes from untilcrawl, parse and index time.However these information can be extracted from the link graphstored in the web db.Please have a look to my patch that allows to store custom metadata in the web db that can be set until parsing a page and isavailable until indexing time and is merged back to the web db aswell.
http://issues.apache.org/jira/browse/NUTCH-59
I still think the patch makes a lot sense and I'm using it for myprojects in some cases successfully, however it was not acceptedand committed yet.
You can find a comment from Doug in the mailing list about the patch.
He think that with the new map reduce architecture this would beobsolete - that is true, but I think it will need a long timepeople will start using map reduce.I guess most people using nutch using it for small size specialinterested search engines and mostly integrate them within customapplications. The new map reduce architecture will make thisintegration task much much more difficult as it already is.
 Greetings,
Stefan


Am 19.07.2005 um 14:57 schrieb Erik Hatcher:
Hi,
I'm embarking on an adventure with Nutch to crawl 19th centurydigital scholarly archives (like the Rossetti Archive, where Iwork) for the nines.org system. The goal is to use a normal crawlon a selected set of sites, and extract some additionalinformation in the process. Many HTML pages of these archiveswill be tagged with a <head> tag like this page http://www.rossettiarchive.org/docs/1-1882.s241.raw.html:
<link type="application/rdf+xml" title="The Question (for aDesign)" href="http://www.rossettiarchive.org/docs/1-1882.s241.raw.rdf">
What I need is a facility to fetch and parse that RDF in a customway, such that the RDF gets dropped into an RDF engine (currentlyusing Kowari). At the point of processing the RDF data I want toknow the URL of the page it came from (the one containing the<link>) such that I can add another RDF statement to the data withprovenance information.
I see that Nutch fetches the RDF (in the crawl log using the basiccrawl command). Can a parse plugin know what page the RDF linkcame from? If not, then how should I craft things to get that info?
At this point I'm a newbie with Nutch, and glad to have thismailing list for advice. I'm quite open to suggestions on how togo about building this custom add-on to Nutch and quite willing togeneralize it and contribute it.
Thanks,
    Erik
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Re: RDF plugin questions

Reply via email to