meta data in webdb

Stefan Groschupf Sun, 22 May 2005 09:59:34 -0700

Hi,

meta data in web db was discussed some times in this list.

It is actually only possible using workarounds like store meta datain a external db or implement a custom IWebDBReader / IWebDBWriter.

However this workarounds are very slow.

Since I guess most of the users using nutch for smaller scale,special interest kinds of search engine - I think meta data supportis a very interesting feature.

Therefore I would love to suggest a very simple way to store metadata in the webdb.

 http://issues.apache.org/jira/browse/NUTCH-59

This patch works for our needs however it is just a simpleadditional field in the page object.

I created a WritableMap and add accessors in the Page object.

To illustrate the usage I change the WebDBInjector to be able storeDMOZ topic in the webdb.Furthermore I changed the index-more plugin to store all page metadata in the index as well.

So it take you less than 5 min to create a query-filter that supportqueries for DMOZ topics like "topic:Top/Science quantum particle".


Sure the webdb size will be blow up only until you use meta data.

But dependence on your mass of meta data the performance is justslower as you hdd need to read / write data. We measure that is muchfaster then lookup external data.

We actually maintain our page meta data by using a custom tool, thatreads the a webdb and creates a new one with meta data.Since this is a additional step in the workflow I would love todiscuss with developers how to add meta data maintenance likesetting and processing meta data until the nutch workflow.

Add a extension point until fetching would be may be a good choicesince the parsed data and text is available until a page statusupdate is done. The DbUpdate tool would bring fresh added meta datalater back to the web db.However at this point the webdb itself is not available, sogenerating meta data based on a link graphs wouldn't be not possible.


Any Comments?


Stefan

meta data in webdb

Reply via email to