Re: [Nutch-dev] [Fwd: Fetch list priority]

Doug Cutting Wed, 19 Oct 2005 10:50:03 -0700

Massimo Miccoli wrote:

Any news about integration of OPIC in mapred? I have time to developOPIC on Nutch Mapred. Can you help me to start?By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams thatthe best way to integrate OPIC in on old webdb, is this way valid also
CrawlDb in Mapred?


Yes.  I think the way to implement this in the mapred branch is:

1. In CrawlDatum.java, replace 'int linkCount' with 'float score'. Thedefault value of this should be 1.0f. This will require changes toaccessors, write, readFields, compareTo etc. A constructor whichspecifies the score should be added. The comparator should sort bydecreasing score.


2. In crawl/Fetcher.java, add the score to the Content's metadata:

  public static String SCORE_KEY = "org.apache.nutch.crawl.score";
  ...
  private void output(...) {
    ...
    content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
    ...
  }

3. In ParseOutputFormat.java, when writing the CrawlDatum for eachoutlink (line 77), set the score of the link CrawlDatum to be the scoreof the page:


   float score =
     Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
   score /= links.length;
   for (int i = 0; i < links.length, ...) {
     ...
       new CrawlDatum(CrawlDatum.STATUS_LINKED,
                      interval, score);
     ...
   }

4. In CrawlDbReducer.java, remove linkCount calculations. Replace thesewith something like:


  float scoreIncrement = 0.0f;
  while (values.next()) {
    ...
    switch (datum.getStatus()) {
    ...
    CrawlDatum.STATUS_LINKED:
      scoreIncrement += datum.getScore();
      break;
    ...
  }
  ...
  result.setScore(result.getScore() + scoreIncrement);

I think that should do it, no?

Doug

Re: [Nutch-dev] [Fwd: Fetch list priority]

Reply via email to