Stefan - thanks for the reply. I'm still digesting Nutch and how to
work with it at a basic level but it does make sense to allow
metadata to tag along with fetches - I certainly don't know enough
yet to say whether your patch fits into the long-term vision of Nutch
or not yet.
I've started writing a custom RDF parser plugin that will take the
URL and simply add it to Kowari (letting Kowari actually parse it and
ingest it). But I'm feeling like this might not be the best approach.
At what stage would make the most sense for ingesting RDF into an
external system? Is parsing the most logical stage?
Further on this topic, I'm curious about indexing multiple
"documents" per .rdf file fetched - for instance, one document per
RDF "resource". Is this currently possible with a plugin? If not,
what would it take to do something like this? Maybe this approach
doesn't even make sense in the Nutch sense - I'm just exploring my
architectural options.
Thanks,
Erik
On Jul 19, 2005, at 9:19 AM, Stefan Groschupf wrote:
Hi Erik,
as far I know you have not the source page a url comes from until
crawl, parse and index time.
However these information can be extracted from the link graph
stored in the web db.
Please have a look to my patch that allows to store custom meta
data in the web db that can be set until parsing a page and is
available until indexing time and is merged back to the web db as
well.
http://issues.apache.org/jira/browse/NUTCH-59
I still think the patch makes a lot sense and I'm using it for my
projects in some cases successfully, however it was not accepted
and committed yet.
You can find a comment from Doug in the mailing list about the patch.
He think that with the new map reduce architecture this would be
obsolete - that is true, but I think it will need a long time
people will start using map reduce.
I guess most people using nutch using it for small size special
interested search engines and mostly integrate them within custom
applications. The new map reduce architecture will make this
integration task much much more difficult as it already is.
Greetings,
Stefan
Am 19.07.2005 um 14:57 schrieb Erik Hatcher:
Hi,
I'm embarking on an adventure with Nutch to crawl 19th century
digital scholarly archives (like the Rossetti Archive, where I
work) for the nines.org system. The goal is to use a normal crawl
on a selected set of sites, and extract some additional
information in the process. Many HTML pages of these archives
will be tagged with a <head> tag like this page http://
www.rossettiarchive.org/docs/1-1882.s241.raw.html:
<link type="application/rdf+xml" title="The Question (for a
Design)" href="http://www.rossettiarchive.org/docs/
1-1882.s241.raw.rdf">
What I need is a facility to fetch and parse that RDF in a custom
way, such that the RDF gets dropped into an RDF engine (currently
using Kowari). At the point of processing the RDF data I want to
know the URL of the page it came from (the one containing the
<link>) such that I can add another RDF statement to the data with
provenance information.
I see that Nutch fetches the RDF (in the crawl log using the basic
crawl command). Can a parse plugin know what page the RDF link
came from? If not, then how should I craft things to get that info?
At this point I'm a newbie with Nutch, and glad to have this
mailing list for advice. I'm quite open to suggestions on how to
go about building this custom add-on to Nutch and quite willing to
generalize it and contribute it.
Thanks,
Erik
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general