New Scoring and Indexing Systems for Nutch 1.0

Dennis Kubes Fri, 25 Jul 2008 13:51:34 -0700

Besides some final changes including comments, documentation, and somesimple api changes to upgrade to the hadoop 0.17 api I have completedwork on a new scoring and indexing framework. The purpose of this emailis to explain the new framework and the tools / jobs that it contains.

The new scoring framework is meant to replace the current OPIC scoringsystem currently in Nutch. The new scoring tools can be found under theorg.apache.nutch.scoring.webgraph package. This package contains aWebGraph job, a Loops job, a LinkRank job, a ScoreUpdater job, tools fordumping and reading different database.

To use the new scoring you would start by running the WebGraph job.This job will take one or more segments and create an outlink database,an inlink database containing all inlinks (no max number of inlinks),and a nodes database that holds the number of in and outlinks and ascore for each node. The web graph is able to be updated and takes intoaccount timestamps for fetches when processing. Links from newerfetches will replace links from older fetches meaning links from a givenurl should always be the most recent for that url and there should be noholdover links which no longer exist. This ensures that as linkstructures for web pages change, the web graph changes to accommodateand those changes will be reflected in the later link analysis scores.

Once the WebGraph job is run the Loops job can be run. This job willtake the webgraph and walk outlinks in an attempt to find link cycles inthe graph. This job is computationally expensive and after 2 loopsrequires a great deal of space. Because of this it is optional but itcan help in identifying spam pages and link farms and if run its outputwill be used inside of the link analysis tool.

The next job is the LinkRank job. This is a link analysis tool similarto page-rank that creates a stable score for a page based on a webpage'sinlinks and their recursive scores. LinkRank runs through an iterativecycle to get a converging link score. The scores are stored in the nodedatabase inside the web graph.

Once the LinkRank job is run you can use the ScoreUpdater to update acrawl database with the link score for each url. This allows crawls togenerate out better topN pages to fetch. Within the older indexingmethods the crawl database scores were used as an element in thedocument boost in the index. The new framework includes a scoring-linkplugin that allows it to work with the older indexer. That plugin willtake the score from the crawl database, after the ScoreUpdater is run,and will use those scores as the document boost for the url. The newindexing framework does not use the crawl database and takes its scoresdirectly from the node database in the web graph. The ScoreUpdaterstill needs to be run for the generator to work more efficiently.

The other jobs in the scoring package include LinkDumper, LoopReader,NodeDumper, and NodeReader. LinkDumper creates a database where it iseasy to show the inlinks and outlinks to a given page. There is a maxnumber of inlinks that can be stored and displayed in this database perurl. LoopReader is used to show the link cycles for a given url. Thiscan only be used if the Loops job was run. The NodeDumper creates atext file showing the top urls for number of inlinks, number ofoutlinks, or link score. There are different options for each and eachmust be run separately. This is useful is seeing the top ranking urls.NodeDumper can be run after LinkRank is run or after WebGraph islooking at inlinks or outlinks. NodeReader prints out a single url fromthe node database.

To recap, scoring would now be run in the following form. Inject,Generate, Fetch and Parse, WebGraph, Loops (optional), LinkRank,ScoreUpdater, then indexing.

The next piece is the new indexing framework and it can be found in theorg.apache.nutch.indexer.field package. The current indexer is somewhatlimited in what databases can be passed in to the indexing process.This new indexer removes that limitation and gives more granular controlover the fields, their boosts, how they are indexed and the documentboosts included in an index. The new indexing process consists of twophases. One is taking content or analysis output and creatingFieldWritable objects. Two is aggregating those FieldWritable objectsand indexing them. There are three jobs in the field package,BasicFields, AnchorFields, and FieldIndexer.

The BasicFields job replaces the current indexer and the index-basicindexing plugin. This job will take one or more segments, find the mostrecently fetched segment for a given url, and create the appropriatefields for the index. In doing BasicFields removes a common form ofduplicates inside of an index, the same url being fetched throughmultiple redirects.

Basic fields also handles a unique form of representative url logic. Ifa url has a representative url due to redirects, and LinkRank has beenrun, the BasicFields job will compare the url and the representative urlby link rank score. The one with the highest score will be kept as theurl to be shown in the index, the other the orig url in the index. Thisis useful as an index will usually want to display the url that ishighest scoring (i.e. www.google.com instead of google.com) for a givenwebpage.

AnchorFields replaces the current index-anchor plugin for Nutch andextends it to allow nutch index the best inlink anchors for a given url.In AnchorFields, inlinks to a url are inverted and scored with theirparents inlink score. Then the anchor text of the highest scoring urlsare converted into FieldWritables to be indexed. Those anchors arealso indexed with a FieldBoost equivalent to their parent page inlinkscore. The idea behind this is higher scoring pages will have betterout links and the text of those links will be more relevant to search.

BasicFields and AnchorFields both create databases Those databases arethen passed into the FieldIndexer. The FieldIndexer in responsible forsimple taking FieldWritable objects and turning them into an Luceneindex. It is also responsible for taking special FieldWritable objectswith the name Field.DOC_BOOST and aggregating them together to create afinal document boost for a given url. This allows multiple types ofanalysis to be run before indexing and their results aggregated tocreate a final document score in the index. All doc boost fields arestored showing their contribution to the final document score. Anexample of a doc boost is the link rank score. The BasicFields job iswhat takes the link rank and creates a FieldWritable with the doc boostname.

The order for running the new indexing jobs would be BasicFields,AnchorFields, and FieldIndexer. These would need to be run after thenew Scoring logic.

So there it is, a new Scoring framework and a new Indexing framework. Ibelieve these two pieces contribute significantly to improving therelevancy in the current Nutch system. These two pieces are currentlyin Jira as NUTCH-635. I hope to finish up comments, documentation, andother small changes within the next few days and move this into thenutch core. If anybody has any questions or comments, feel free.


Dennis

New Scoring and Indexing Systems for Nutch 1.0

Reply via email to