Talat UYARER created NUTCH-2003:
-----------------------------------
Summary: topN is not work correctly
Key: NUTCH-2003
URL: https://issues.apache.org/jira/browse/NUTCH-2003
Project: Nutch
Issue Type: Bug
Affects Versions: 2.3
Reporter: Talat UYARER
Priority: Minor
I want to crawl top 1000 urls which are ordered by scores from webpage table.
It doesnt work correctly.
When I use topN parameter, it is divided by map task counts (topN/
maptaskcounts = maptasktopN) Every map tasks generate maptasktopN urls of map
tasks. Assume as I have 25 map tasks and I set topN parameter as 1000 and
maptasktopN is calculated as 40. As Result We dont have top 1000 highest scored
urls, we have 1000 urls of generated 40 highest scored urls per 25 map tasks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)