I want to modify Nutch for increasing the score of some pages in their CrawlDatum. The objective of this is recognizing which pages include a certain token. Increasing the score to a high value will be useful for being chosen again in the next Segment generation.
I modified like this: Fetcher.java: (...) case ProtocolStatus.SUCCESS: // got a page content.setContent((Integer.toString(0)).getBytes()); if (tokenIncluded) { float score = Float.valueOf( "500.0" ); fit.datum.setScore(score); } (...) After fetching, when the crawldb has to be updated, the entries of the MR update proccess are the fetched urls. Each fetched one appeared twice like this: http://bardeportes.blogspot.com/ Version: 7 Status: 2 (db_fetched) Fetch time: Thu Mar 11 11:06:09 CET 2010 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.0 Signature: null Metadata: _pst_: success(1), lastModified=0 ... http://bardeportes.blogspot.com/ Version: 7 Status: 33 (fetch_success) Fetch time: Thu Mar 11 11:06:17 CET 2010 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 500.0 Signature: null Metadata: _ngt_: 1268301973666_pst_: success(1), lastModified=0 And the final score introduce to crawldb is 1.0 (never 500.0) Any idea? (I hope someone understands the issue) -- View this message in context: http://old.nabble.com/Increasing-the-score-of-especific-pages-tp27861656p27861656.html Sent from the Nutch - Dev mailing list archive at Nabble.com.