Re: meta data in webdb

Doug Cutting Mon, 23 May 2005 14:39:48 -0700

The problem with adding stuff to Page is that it is copied each time theweb db is updated, which means during each iteration of crawling or linkanalysis.

In the future, MapReduce-based, version of Nutch, only crawl-relateddata will be accessed at crawl time. This should make crawling muchfaster, since, e.g., the link graph need no longer be maintained whilecrawling.

For other types of data, it should be possible simply to supply largeflat files keyed by URL whose value is data related to the URL. One setof such files will be the result of fetching and parsing, containingtext, title, date, etc. Another will be the result of a MapReduce passthat inverts links found while parsing, to provide inlink anchors.

It should be simple to provide more such files with metadata. Forexample, the DMOZ file could be converted to a file mapping URLs todescriptions. A reduce operation can combine all information for eachurl for indexing.


Doug

Stefan Groschupf wrote:

Hi,

meta data in web db was discussed some times in this list.
It is actually only possible using workarounds like store meta data ina external db or implement a custom IWebDBReader / IWebDBWriter.
However this workarounds are very slow.
Since I guess most of the users using nutch for smaller scale, specialinterest kinds of search engine - I think meta data support is a veryinteresting feature.
Therefore I would love to suggest a very simple way to store meta datain the webdb.
 http://issues.apache.org/jira/browse/NUTCH-59
This patch works for our needs however it is just a simple additionalfield in the page object.
I created a WritableMap and add accessors in the Page object.
To illustrate the usage I change the WebDBInjector to be able storeDMOZ topic in the webdb.Furthermore I changed the index-more plugin to store all page meta datain the index as well.
So it take you less than 5 min to create a query-filter that supportqueries for DMOZ topics like "topic:Top/Science quantum particle".
Sure the webdb size will be blow up only until you use meta data.
But dependence on your mass of meta data the performance is just sloweras you hdd need to read / write data. We measure that is much fasterthen lookup external data.
We actually maintain our page meta data by using a custom tool, thatreads the a webdb and creates a new one with meta data.Since this is a additional step in the workflow I would love to discusswith developers how to add meta data maintenance like setting andprocessing meta data until the nutch workflow.
Add a extension point until fetching would be may be a good choicesince the parsed data and text is available until a page status updateis done. The DbUpdate tool would bring fresh added meta data later backto the web db.However at this point the webdb itself is not available, so generatingmeta data based on a link graphs wouldn't be not possible.
Any Comments?


Stefan

Re: meta data in webdb

Reply via email to