Hi,

meta data in web db was discussed some times in this list.
It is actually only possible using workarounds like store meta data in a external db or implement a custom IWebDBReader / IWebDBWriter.
However this workarounds are very slow.
Since I guess most of the users using nutch for smaller scale, special interest kinds of search engine - I think meta data support is a very interesting feature.

Therefore I would love to suggest a very simple way to store meta data in the webdb.
 http://issues.apache.org/jira/browse/NUTCH-59


This patch works for our needs however it is just a simple additional field in the page object.
I created a WritableMap and add accessors in the Page object.
To illustrate the usage I change the WebDBInjector to be able store DMOZ topic in the webdb. Furthermore I changed the index-more plugin to store all page meta data in the index as well.

So it take you less than 5 min to create a query-filter that support queries for DMOZ topics like "topic:Top/Science quantum particle".

Sure the webdb size will be blow up only until you use meta data.
But dependence on your mass of meta data the performance is just slower as you hdd need to read / write data. We measure that is much faster then lookup external data.

We actually maintain our page meta data by using a custom tool, that reads the a webdb and creates a new one with meta data. Since this is a additional step in the workflow I would love to discuss with developers how to add meta data maintenance like setting and processing meta data until the nutch workflow.

Add a extension point until fetching would be may be a good choice since the parsed data and text is available until a page status update is done. The DbUpdate tool would bring fresh added meta data later back to the web db. However at this point the webdb itself is not available, so generating meta data based on a link graphs wouldn't be not possible.

Any Comments?


Stefan






Reply via email to