Hi,
meta data in web db was discussed some times in this list.
It is actually only possible using workarounds like store meta data
in a external db or implement a custom IWebDBReader / IWebDBWriter.
However this workarounds are very slow.
Since I guess most of the users using nutch for smaller scale,
special interest kinds of search engine - I think meta data support
is a very interesting feature.
Therefore I would love to suggest a very simple way to store meta
data in the webdb.
http://issues.apache.org/jira/browse/NUTCH-59
This patch works for our needs however it is just a simple
additional field in the page object.
I created a WritableMap and add accessors in the Page object.
To illustrate the usage I change the WebDBInjector to be able store
DMOZ topic in the webdb.
Furthermore I changed the index-more plugin to store all page meta
data in the index as well.
So it take you less than 5 min to create a query-filter that support
queries for DMOZ topics like "topic:Top/Science quantum particle".
Sure the webdb size will be blow up only until you use meta data.
But dependence on your mass of meta data the performance is just
slower as you hdd need to read / write data. We measure that is much
faster then lookup external data.
We actually maintain our page meta data by using a custom tool, that
reads the a webdb and creates a new one with meta data.
Since this is a additional step in the workflow I would love to
discuss with developers how to add meta data maintenance like
setting and processing meta data until the nutch workflow.
Add a extension point until fetching would be may be a good choice
since the parsed data and text is available until a page status
update is done. The DbUpdate tool would bring fresh added meta data
later back to the web db.
However at this point the webdb itself is not available, so
generating meta data based on a link graphs wouldn't be not possible.
Any Comments?
Stefan