A few weeks ago I drafted the attached document, discussing how MapReduce might be used in Nutch. This is an incomplete, exploratory document, not a final design. Most of Nutch's file formats are altered. Every operation is implemented with MapReduce. To run things on a single machine we can automatically start a job tracker one or more task trackers, all running in the same JVM. Hopefully this will not be much slower than the current implementation running on a single machine.

Comments?

Doug
                          MapReduce in Nutch

                             Doug Cutting
                            10 March, 2005
                                DRAFT

An outline of how MapReduce might be used to implement Nutch.

Note: link analysis is not yet discussed.

DATA STRUCTURES

All datastructures are NDFS directories containing <key,value> files.

1. Segment

  A segment is a set of URLs and associated information, stored
  together in a directory.  Each sub-directory (except the index)
  contains a flat file of records keyed by url.  Subdiretories
  include:

  fetchIn:    <url, <lastContentHash, lastFetchDate> >
  fetchOut:   <url, <status, fetchDate, contentHash> >
  content:    <url, <contentType, bytes, metaData> >
  text:       <url, <field, text>* > >
  outlinks:   <url, <anchor, url>* > >
  inlinks:    <url, <anchor>* > >
  
2. PageDB

  The PageDB is used to crawl.  Initially it is empty.  Each round of
  fetching updates the list of known urls and their status.  The
  database is a directory of flat files.

  pages:      <url, <status, contentHash, lastFetchDate, numFailures> >

3. LinkDB

  The LinkDB is used to compute anchor text and link analysis
  scores.  It contains an annotated link graph.

  linksByDest:  <<destSegment destUrl>, <srcUrl, anchor>* >


Example:

A system while crawling might have the following files:

   segment/0/fetchIn/part-0
             /fetchOut/part-0
                      /part-1
                      /part-2
             /content/part-0
                     /part-1
                     /part-2
   segment/1/fetchIn/part-0
                     /part-1
                     /part-2
             /fetchOut/part-0
                      /part-1
                      /part-2
             /content/part-0
                     /part-1
                     /part-2
   pageDB/pages/part-0   
               /part-1   
               /part-2

PROCESSING STEPS

0. Bootstrap

  Create segment/0/fetchIn from a flat file of urls.

1. Fetch

  a. MapReduce: segment/N/fetchIn -> segment/N/{0-M}/fetchIn

    Split urls by domain, so that fetching can be polite.  Map is
    identity; partition is hash of url's domain; reduce sorts by hash
    of URL, to evenly distribute urls from a domain in the file.

  b. MapReduce: segment/N/M/fetchIn -> segment/N/{fetchOut,content}

    Map is a multithreaded fetcher; partition by URL; reduce sorts by
    URL so that output is back in canonical form.

2. Parse

  a. MapReduce: segment/N/content -> segment/N/{text,outlinks}

    Parse HTML, PDF, etc., extracting text and links.

3. Update PageDB

  a. MapReduce: segment/N/fetchOut -> segment/N/fetch.pages

    Create a mini PageDB for the segment's pages.  Status is set to
    one of FETCH_SUCCESS, FETCH_FAIL_TEMPORARY, FETCH_FAIL_PERMANENT.

  b. MapReduce: segment/N/outlinks -> segment/N/outlink.pages

    Create a mini PageDB for pages linked from the segment .  The
    reduce method can combine outlinks to the same URL, substantially
    reducing the work for the next step.  Output is sorted by the
    target urls.  Status is set to LINK.

    This is where new urls enter the system.  Urls may be filtered or
    rewritten based on configurable plugins.  By default some simple
    regex rules are used to filter out, e.g., urls with query strings
    or with image file type suffixes.

  a. MapReduce: { PageDB/pages, segment/N/*.pages } ->  PageDB/pages.new

    Construct a new version of the PageDB with the updated status for
    each page.  Reduce combines information for each URL and assigns a
    status of DB_FETCHED, DB_UNFETCHED or eliminates the page.  Status
    codes can be prioritized as follows: 
       FETCH_* > LINK > DB_*
    The highest priority status determines the output status.  Thus a
    FETCH_SUCCESS status beats all other status codes and sets an
    output status of DB_FETCHED.  If a FETCH_FAIL_PERMANENT is seen or
    if FETCH_FAIL_TEMPORARY is seen and the retry count has been
    exceeded, then the url is omitted from the output.  Otherwise the
    output status is DB_UNFETCHED, perhaps with an incremented failure
    count.
  
  c. rm PageDB/pages; mv PageDB/pages.new PageDB/pages

    Install the new version of the DB.

4. Generate new segment

  a. MapReduce: PageDB/pages -> segment/N/fetchIn

    Initializes a new segment with a fetchIn generated from the
    PageDB.  Various criteria can be used, such as limiting URLs by
    domain, incorporating link analysis to prioritize urls, refetching
    high-priority pages.  In the simplest case this just emits all of
    the DB_UNFETCHED pages.

5. Generate inlinks

  a. MapReduce: segment/*/outlinks -> LinkDB/linksByDest.noSeg

    Invert the links.  The destination segment number is unknown and
    -1 is used as a placeholder.  Thus output is of the form:

       <<destUrl, -1>, <srcUrl, anchor>* >


  b. MapReduce: segment/*/fetchOut -> LinkDB/linksByDest.empty

    No links are known here, but segment numbers are.  Thus output is
    of the form:

       <<destUrl, segment>, <> >

  c. MapReduce: LinkDB/linksByDest.* -> segment/*/inlinks

6. Index

  This uses Lucene to build indexes for the data.

  a. MapReduce: segment/*/* -> indexes/*

  The reduce method has all the data for each url, and writes a Lucene
  index as a side-effect.  The number of indexes produced is the
  number of reduce tasks.  The partition method determines how urls
  are partitioned among indexes.

7. Global Dedup

8. Deploy indexes


HIGH-LEVEL OPERATION

To batch crawl, steps 1-4 are iterated, followed by 5-8.

To incrementally update:
  1. Generate new segment
  2. Fetch segment
  3. Parse segment
  4. (optional) Update PageDB 
  5. (optional) Update LinkDB
  6. Generate Inlinks for segment
  7. Index segment.
  8. Global Dedup
  9. Deploy new index & update deletions in old indexes.

Reply via email to