Stefan - thanks for the reply. I'm still digesting Nutch and how to work with it at a basic level but it does make sense to allow metadata to tag along with fetches - I certainly don't know enough yet to say whether your patch fits into the long-term vision of Nutch or not yet.

I've started writing a custom RDF parser plugin that will take the URL and simply add it to Kowari (letting Kowari actually parse it and ingest it). But I'm feeling like this might not be the best approach.

At what stage would make the most sense for ingesting RDF into an external system? Is parsing the most logical stage?

Further on this topic, I'm curious about indexing multiple "documents" per .rdf file fetched - for instance, one document per RDF "resource". Is this currently possible with a plugin? If not, what would it take to do something like this? Maybe this approach doesn't even make sense in the Nutch sense - I'm just exploring my architectural options.

Thanks,
    Erik



On Jul 19, 2005, at 9:19 AM, Stefan Groschupf wrote:

Hi Erik,

as far I know you have not the source page a url comes from until crawl, parse and index time. However these information can be extracted from the link graph stored in the web db. Please have a look to my patch that allows to store custom meta data in the web db that can be set until parsing a page and is available until indexing time and is merged back to the web db as well.
http://issues.apache.org/jira/browse/NUTCH-59

I still think the patch makes a lot sense and I'm using it for my projects in some cases successfully, however it was not accepted and committed yet.
You can find a comment from Doug in the mailing list about the patch.
He think that with the new map reduce architecture this would be obsolete - that is true, but I think it will need a long time people will start using map reduce. I guess most people using nutch using it for small size special interested search engines and mostly integrate them within custom applications. The new map reduce architecture will make this integration task much much more difficult as it already is.

 Greetings,
Stefan


Am 19.07.2005 um 14:57 schrieb Erik Hatcher:


Hi,

I'm embarking on an adventure with Nutch to crawl 19th century digital scholarly archives (like the Rossetti Archive, where I work) for the nines.org system. The goal is to use a normal crawl on a selected set of sites, and extract some additional information in the process. Many HTML pages of these archives will be tagged with a <head> tag like this page http:// www.rossettiarchive.org/docs/1-1882.s241.raw.html:

<link type="application/rdf+xml" title="The Question (for a Design)" href="http://www.rossettiarchive.org/docs/ 1-1882.s241.raw.rdf">

What I need is a facility to fetch and parse that RDF in a custom way, such that the RDF gets dropped into an RDF engine (currently using Kowari). At the point of processing the RDF data I want to know the URL of the page it came from (the one containing the <link>) such that I can add another RDF statement to the data with provenance information.

I see that Nutch fetches the RDF (in the crawl log using the basic crawl command). Can a parse plugin know what page the RDF link came from? If not, then how should I craft things to get that info?

At this point I'm a newbie with Nutch, and glad to have this mailing list for advice. I'm quite open to suggestions on how to go about building this custom add-on to Nutch and quite willing to generalize it and contribute it.

Thanks,
    Erik







-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Reply via email to