Markus Jelsma created NUTCH-2368:
------------------------------------
Summary: Variable generate.max.count and fetcher.server.delay
Key: NUTCH-2368
URL: https://issues.apache.org/jira/browse/NUTCH-2368
Project: Nutch
Issue Type: Improvement
Components: generator
Affects Versions: 1.12
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.13
Attachments: NUTCH-2368.patch
In some cases we need to use host specific characteristics in determining crawl
speed and bulk sizes because with our (Openindex) settings we can just recrawl
host with up to 800k urls.
This patch solves the problem by introducing the HostDB to the Generator and
providing powerful Jexl expressions. Check these two expressions added to the
Generator:
{code}
-Dgenerate.max.count.expr='
if (unfetched + fetched > 800000) {
return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500)
/ 1000) * conf.getInt("fetcher.threads.per.queue", 1)
} else {
return conf.getDouble("generate.max.count", 300);
}'
-Dgenerate.fetch.delay.expr='
if (unfetched + fetched > 800000) {
return (pct95._rs_ + 500) * 1000;
} else {
return conf.getDouble("fetcher.server.delay", 1000)
}'
{code}
For each large host: select as many records as possible that are possible to
fetch based on number of threads, 95th percentile response time of the fetch
limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
The second expression just follows up to that, settings the crawlDelay of the
fetch queue.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)