The problem with adding stuff to Page is that it is copied each time the
web db is updated, which means during each iteration of crawling or link
analysis.
In the future, MapReduce-based, version of Nutch, only crawl-related
data will be accessed at crawl time. This should make crawling much
faster, since, e.g., the link graph need no longer be maintained while
crawling.
For other types of data, it should be possible simply to supply large
flat files keyed by URL whose value is data related to the URL. One set
of such files will be the result of fetching and parsing, containing
text, title, date, etc. Another will be the result of a MapReduce pass
that inverts links found while parsing, to provide inlink anchors.
It should be simple to provide more such files with metadata. For
example, the DMOZ file could be converted to a file mapping URLs to
descriptions. A reduce operation can combine all information for each
url for indexing.
Doug
Stefan Groschupf wrote:
Hi,
meta data in web db was discussed some times in this list.
It is actually only possible using workarounds like store meta data in
a external db or implement a custom IWebDBReader / IWebDBWriter.
However this workarounds are very slow.
Since I guess most of the users using nutch for smaller scale, special
interest kinds of search engine - I think meta data support is a very
interesting feature.
Therefore I would love to suggest a very simple way to store meta data
in the webdb.
http://issues.apache.org/jira/browse/NUTCH-59
This patch works for our needs however it is just a simple additional
field in the page object.
I created a WritableMap and add accessors in the Page object.
To illustrate the usage I change the WebDBInjector to be able store
DMOZ topic in the webdb.
Furthermore I changed the index-more plugin to store all page meta data
in the index as well.
So it take you less than 5 min to create a query-filter that support
queries for DMOZ topics like "topic:Top/Science quantum particle".
Sure the webdb size will be blow up only until you use meta data.
But dependence on your mass of meta data the performance is just slower
as you hdd need to read / write data. We measure that is much faster
then lookup external data.
We actually maintain our page meta data by using a custom tool, that
reads the a webdb and creates a new one with meta data.
Since this is a additional step in the workflow I would love to discuss
with developers how to add meta data maintenance like setting and
processing meta data until the nutch workflow.
Add a extension point until fetching would be may be a good choice
since the parsed data and text is available until a page status update
is done. The DbUpdate tool would bring fresh added meta data later back
to the web db.
However at this point the webdb itself is not available, so generating
meta data based on a link graphs wouldn't be not possible.
Any Comments?
Stefan