Massimo Miccoli wrote:
Any news about integration of OPIC in mapred? I have time to develop
OPIC on Nutch Mapred. Can you help me to start?
By the email from Carlos Alberto-Alejandro CASTILLO-Ocaranza, seams that
the best way to integrate OPIC in on old webdb, is this way valid also
CrawlDb in Mapred?
Yes. I think the way to implement this in the mapred branch is:
1. In CrawlDatum.java, replace 'int linkCount' with 'float score'. The
default value of this should be 1.0f. This will require changes to
accessors, write, readFields, compareTo etc. A constructor which
specifies the score should be added. The comparator should sort by
decreasing score.
2. In crawl/Fetcher.java, add the score to the Content's metadata:
public static String SCORE_KEY = "org.apache.nutch.crawl.score";
...
private void output(...) {
...
content.getMetadata().setProperty(SCORE_KEY, datum.getScore());
...
}
3. In ParseOutputFormat.java, when writing the CrawlDatum for each
outlink (line 77), set the score of the link CrawlDatum to be the score
of the page:
float score =
Float.valueOf(parse.getData().get(Fetcher.SCORE_KEY));
score /= links.length;
for (int i = 0; i < links.length, ...) {
...
new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval, score);
...
}
4. In CrawlDbReducer.java, remove linkCount calculations. Replace these
with something like:
float scoreIncrement = 0.0f;
while (values.next()) {
...
switch (datum.getStatus()) {
...
CrawlDatum.STATUS_LINKED:
scoreIncrement += datum.getScore();
break;
...
}
...
result.setScore(result.getScore() + scoreIncrement);
I think that should do it, no?
Doug