Massimo Miccoli wrote:
Any news about integration of OPIC in mapred? I have time to develop OPIC on Nutch Mapred. Can you help me to start? By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that the best way to integrate OPIC in on old webdb, is this way valid also
CrawlDb in Mapred?

Yes.  I think the way to implement this in the mapred branch is:

1. In CrawlDatum.java, replace 'int linkCount' with 'float score'. The default value of this should be 1.0f. This will require changes to accessors, write, readFields, compareTo etc. A constructor which specifies the score should be added. The comparator should sort by decreasing score.

2. In crawl/Fetcher.java, add the score to the Content's metadata:

  public static String SCORE_KEY = "org.apache.nutch.crawl.score";
  ...
  private void output(...) {
    ...
    content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
    ...
  }


3. In ParseOutputFormat.java, when writing the CrawlDatum for each outlink (line 77), set the score of the link CrawlDatum to be the score of the page:

   float score =
     Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
   score /= links.length;
   for (int i = 0; i < links.length, ...) {
     ...
       new CrawlDatum(CrawlDatum.STATUS_LINKED,
                      interval, score);
     ...
   }

4. In CrawlDbReducer.java, remove linkCount calculations. Replace these with something like:

  float scoreIncrement = 0.0f;
  while (values.next()) {
    ...
    switch (datum.getStatus()) {
    ...
    CrawlDatum.STATUS_LINKED:
      scoreIncrement += datum.getScore();
      break;
    ...
  }
  ...
  result.setScore(result.getScore() + scoreIncrement);

I think that should do it, no?

Doug

Reply via email to