Hi Otis,

http://issues.apache.org/jira/browse/NUTCH-59

This patch looks interesting for my Nutch needs,
So please vote for the patch if you like it. :-)

I can't look at the code, but looking at your diff, it looks like this
metadata would be stored somewhere inside Nutch's WebDB, and that one
would have to provide this metadata to Nutch during URL injection....
is this correct?
Yes, meta data are part of the page object and stored in the webdb.
You can add  metadata in any situation you maintain this page object.
So you can have a custom injector as you describe to set meta data, but more interesting you can set them until fetch time as well or in any situation you have access to the page object. (e.g. segment generation, dbupdated etc.)


I currently have this little "wrapper method" around a few of Nutch's
tool classes (below). I first generate a plain-text file with all URLs
I want to fetch, then I call the method below, and then I just call
Fetcher.main(...).  If I want to associate some metadata with each URL
to be fetched, where would I insert it into the system?  Would I need
my own injector class with my own addPage method that pulls metadata in
(from some external storage) for each URL it gets, and call
dbWriter.addPageIfNotPresent(page) like WebDBInjector does with DMOZ
data?

Yes.
I personal suggest create a extension point for the injector. May other people find that interesting as well and you can contribute this extension point. :) Write a small plugin that lookup the meta data you plan to add from a mysql-db or so and add them to the page object. That's it. You can do very interesting things until the life-cycle of the page object. For example generate metadata from html content, fetch time (more intelligent fetchnig), or hand over meta data from pages to links etc.

Keep in mind that you can have other storage types than the actually existing map for storing meta data. For example you can implement a StringArray or other datatypes.



In general I would love to see more interest in this patch and some votes since I think that such meta data can brings a lot of new possible features to nutch. The very interesting part is that if you do not use meta data the web db is not blowed up and that this patch does not slow down the web db processing speed.

Greetings,
Stefan 

Reply via email to