Comments?
Doug
MapReduce in Nutch
Doug Cutting
10 March, 2005
DRAFT
An outline of how MapReduce might be used to implement Nutch.
Note: link analysis is not yet discussed.
DATA STRUCTURES
All datastructures are NDFS directories containing <key,value> files.
1. Segment
A segment is a set of URLs and associated information, stored
together in a directory. Each sub-directory (except the index)
contains a flat file of records keyed by url. Subdiretories
include:
fetchIn: <url, <lastContentHash, lastFetchDate> >
fetchOut: <url, <status, fetchDate, contentHash> >
content: <url, <contentType, bytes, metaData> >
text: <url, <field, text>* > >
outlinks: <url, <anchor, url>* > >
inlinks: <url, <anchor>* > >
2. PageDB
The PageDB is used to crawl. Initially it is empty. Each round of
fetching updates the list of known urls and their status. The
database is a directory of flat files.
pages: <url, <status, contentHash, lastFetchDate, numFailures> >
3. LinkDB
The LinkDB is used to compute anchor text and link analysis
scores. It contains an annotated link graph.
linksByDest: <<destSegment destUrl>, <srcUrl, anchor>* >
Example:
A system while crawling might have the following files:
segment/0/fetchIn/part-0
/fetchOut/part-0
/part-1
/part-2
/content/part-0
/part-1
/part-2
segment/1/fetchIn/part-0
/part-1
/part-2
/fetchOut/part-0
/part-1
/part-2
/content/part-0
/part-1
/part-2
pageDB/pages/part-0
/part-1
/part-2
PROCESSING STEPS
0. Bootstrap
Create segment/0/fetchIn from a flat file of urls.
1. Fetch
a. MapReduce: segment/N/fetchIn -> segment/N/{0-M}/fetchIn
Split urls by domain, so that fetching can be polite. Map is
identity; partition is hash of url's domain; reduce sorts by hash
of URL, to evenly distribute urls from a domain in the file.
b. MapReduce: segment/N/M/fetchIn -> segment/N/{fetchOut,content}
Map is a multithreaded fetcher; partition by URL; reduce sorts by
URL so that output is back in canonical form.
2. Parse
a. MapReduce: segment/N/content -> segment/N/{text,outlinks}
Parse HTML, PDF, etc., extracting text and links.
3. Update PageDB
a. MapReduce: segment/N/fetchOut -> segment/N/fetch.pages
Create a mini PageDB for the segment's pages. Status is set to
one of FETCH_SUCCESS, FETCH_FAIL_TEMPORARY, FETCH_FAIL_PERMANENT.
b. MapReduce: segment/N/outlinks -> segment/N/outlink.pages
Create a mini PageDB for pages linked from the segment . The
reduce method can combine outlinks to the same URL, substantially
reducing the work for the next step. Output is sorted by the
target urls. Status is set to LINK.
This is where new urls enter the system. Urls may be filtered or
rewritten based on configurable plugins. By default some simple
regex rules are used to filter out, e.g., urls with query strings
or with image file type suffixes.
a. MapReduce: { PageDB/pages, segment/N/*.pages } -> PageDB/pages.new
Construct a new version of the PageDB with the updated status for
each page. Reduce combines information for each URL and assigns a
status of DB_FETCHED, DB_UNFETCHED or eliminates the page. Status
codes can be prioritized as follows:
FETCH_* > LINK > DB_*
The highest priority status determines the output status. Thus a
FETCH_SUCCESS status beats all other status codes and sets an
output status of DB_FETCHED. If a FETCH_FAIL_PERMANENT is seen or
if FETCH_FAIL_TEMPORARY is seen and the retry count has been
exceeded, then the url is omitted from the output. Otherwise the
output status is DB_UNFETCHED, perhaps with an incremented failure
count.
c. rm PageDB/pages; mv PageDB/pages.new PageDB/pages
Install the new version of the DB.
4. Generate new segment
a. MapReduce: PageDB/pages -> segment/N/fetchIn
Initializes a new segment with a fetchIn generated from the
PageDB. Various criteria can be used, such as limiting URLs by
domain, incorporating link analysis to prioritize urls, refetching
high-priority pages. In the simplest case this just emits all of
the DB_UNFETCHED pages.
5. Generate inlinks
a. MapReduce: segment/*/outlinks -> LinkDB/linksByDest.noSeg
Invert the links. The destination segment number is unknown and
-1 is used as a placeholder. Thus output is of the form:
<<destUrl, -1>, <srcUrl, anchor>* >
b. MapReduce: segment/*/fetchOut -> LinkDB/linksByDest.empty
No links are known here, but segment numbers are. Thus output is
of the form:
<<destUrl, segment>, <> >
c. MapReduce: LinkDB/linksByDest.* -> segment/*/inlinks
6. Index
This uses Lucene to build indexes for the data.
a. MapReduce: segment/*/* -> indexes/*
The reduce method has all the data for each url, and writes a Lucene
index as a side-effect. The number of indexes produced is the
number of reduce tasks. The partition method determines how urls
are partitioned among indexes.
7. Global Dedup
8. Deploy indexes
HIGH-LEVEL OPERATION
To batch crawl, steps 1-4 are iterated, followed by 5-8.
To incrementally update:
1. Generate new segment
2. Fetch segment
3. Parse segment
4. (optional) Update PageDB
5. (optional) Update LinkDB
6. Generate Inlinks for segment
7. Index segment.
8. Global Dedup
9. Deploy new index & update deletions in old indexes.
