Besides some final changes including comments, documentation, and some
simple api changes to upgrade to the hadoop 0.17 api I have completed
work on a new scoring and indexing framework. The purpose of this email
is to explain the new framework and the tools / jobs that it contains.
The new scoring framework is meant to replace the current OPIC scoring
system currently in Nutch. The new scoring tools can be found under the
org.apache.nutch.scoring.webgraph package. This package contains a
WebGraph job, a Loops job, a LinkRank job, a ScoreUpdater job, tools for
dumping and reading different database.
To use the new scoring you would start by running the WebGraph job.
This job will take one or more segments and create an outlink database,
an inlink database containing all inlinks (no max number of inlinks),
and a nodes database that holds the number of in and outlinks and a
score for each node. The web graph is able to be updated and takes into
account timestamps for fetches when processing. Links from newer
fetches will replace links from older fetches meaning links from a given
url should always be the most recent for that url and there should be no
holdover links which no longer exist. This ensures that as link
structures for web pages change, the web graph changes to accommodate
and those changes will be reflected in the later link analysis scores.
Once the WebGraph job is run the Loops job can be run. This job will
take the webgraph and walk outlinks in an attempt to find link cycles in
the graph. This job is computationally expensive and after 2 loops
requires a great deal of space. Because of this it is optional but it
can help in identifying spam pages and link farms and if run its output
will be used inside of the link analysis tool.
The next job is the LinkRank job. This is a link analysis tool similar
to page-rank that creates a stable score for a page based on a webpage's
inlinks and their recursive scores. LinkRank runs through an iterative
cycle to get a converging link score. The scores are stored in the node
database inside the web graph.
Once the LinkRank job is run you can use the ScoreUpdater to update a
crawl database with the link score for each url. This allows crawls to
generate out better topN pages to fetch. Within the older indexing
methods the crawl database scores were used as an element in the
document boost in the index. The new framework includes a scoring-link
plugin that allows it to work with the older indexer. That plugin will
take the score from the crawl database, after the ScoreUpdater is run,
and will use those scores as the document boost for the url. The new
indexing framework does not use the crawl database and takes its scores
directly from the node database in the web graph. The ScoreUpdater
still needs to be run for the generator to work more efficiently.
The other jobs in the scoring package include LinkDumper, LoopReader,
NodeDumper, and NodeReader. LinkDumper creates a database where it is
easy to show the inlinks and outlinks to a given page. There is a max
number of inlinks that can be stored and displayed in this database per
url. LoopReader is used to show the link cycles for a given url. This
can only be used if the Loops job was run. The NodeDumper creates a
text file showing the top urls for number of inlinks, number of
outlinks, or link score. There are different options for each and each
must be run separately. This is useful is seeing the top ranking urls.
NodeDumper can be run after LinkRank is run or after WebGraph is
looking at inlinks or outlinks. NodeReader prints out a single url from
the node database.
To recap, scoring would now be run in the following form. Inject,
Generate, Fetch and Parse, WebGraph, Loops (optional), LinkRank,
ScoreUpdater, then indexing.
The next piece is the new indexing framework and it can be found in the
org.apache.nutch.indexer.field package. The current indexer is somewhat
limited in what databases can be passed in to the indexing process.
This new indexer removes that limitation and gives more granular control
over the fields, their boosts, how they are indexed and the document
boosts included in an index. The new indexing process consists of two
phases. One is taking content or analysis output and creating
FieldWritable objects. Two is aggregating those FieldWritable objects
and indexing them. There are three jobs in the field package,
BasicFields, AnchorFields, and FieldIndexer.
The BasicFields job replaces the current indexer and the index-basic
indexing plugin. This job will take one or more segments, find the most
recently fetched segment for a given url, and create the appropriate
fields for the index. In doing BasicFields removes a common form of
duplicates inside of an index, the same url being fetched through
multiple redirects.
Basic fields also handles a unique form of representative url logic. If
a url has a representative url due to redirects, and LinkRank has been
run, the BasicFields job will compare the url and the representative url
by link rank score. The one with the highest score will be kept as the
url to be shown in the index, the other the orig url in the index. This
is useful as an index will usually want to display the url that is
highest scoring (i.e. www.google.com instead of google.com) for a given
webpage.
AnchorFields replaces the current index-anchor plugin for Nutch and
extends it to allow nutch index the best inlink anchors for a given url.
In AnchorFields, inlinks to a url are inverted and scored with their
parents inlink score. Then the anchor text of the highest scoring urls
are converted into FieldWritables to be indexed. Those anchors are
also indexed with a FieldBoost equivalent to their parent page inlink
score. The idea behind this is higher scoring pages will have better
out links and the text of those links will be more relevant to search.
BasicFields and AnchorFields both create databases Those databases are
then passed into the FieldIndexer. The FieldIndexer in responsible for
simple taking FieldWritable objects and turning them into an Lucene
index. It is also responsible for taking special FieldWritable objects
with the name Field.DOC_BOOST and aggregating them together to create a
final document boost for a given url. This allows multiple types of
analysis to be run before indexing and their results aggregated to
create a final document score in the index. All doc boost fields are
stored showing their contribution to the final document score. An
example of a doc boost is the link rank score. The BasicFields job is
what takes the link rank and creates a FieldWritable with the doc boost
name.
The order for running the new indexing jobs would be BasicFields,
AnchorFields, and FieldIndexer. These would need to be run after the
new Scoring logic.
So there it is, a new Scoring framework and a new Indexing framework. I
believe these two pieces contribute significantly to improving the
relevancy in the current Nutch system. These two pieces are currently
in Jira as NUTCH-635. I hope to finish up comments, documentation, and
other small changes within the next few days and move this into the
nutch core. If anybody has any questions or comments, feel free.
Dennis
- New Scoring and Indexing Systems for Nutch 1.0 Dennis Kubes
-